Datastage World: October 2012

Thursday, 18 October 2012

Basics Of UNIX

The purpose of this post is to have a single page of frequently used basics commands for getting started with UNIX.

Basic UNIX Command Line (shell) navigation:

Directories:
Directories:
Moving around the file system:
Listing directory contents:
Changing file permissions and attributes
Moving, renaming, and copying files:
Viewing and editing files:

Directories:

File and directory paths in UNIX use the forward slash "/" to separate directory names in a path.

examples:

/	root directory
/usr	directory usr (sub-directory of / "root" directory)
/usr/STRIM100	STRIM100 is a sub directory of /usr

Moving around the file system:

pwd	Show the "present working directory", or current directory.
cd	Change current directory to your HOME directory.
cd /usr/STRIM100	Change current directory to /usr/STRIM100.
cd INIT	Change current directory to INIT which is a sub-directory of the current directory.
cd ..	Change current directory to the parent directory of the current directory.
cd $STRMWORK	Change current directory to the directory defined by the environment variable 'STRMWORK'.
cd ~bob	Change the current directory to the user bob's home directory (if you have permission).

Listing directory contents:

ls	list a directory
ls -l	list a directory in long ( detailed ) format

for example:
$ ls -l
drwxr-xr-x    4 cliff    user        1024 Jun 18 09:40 WAITRON_EARNINGS
-rw-r--r--    1 cliff    user      767392 Jun 6 14:28 scanlib.tar.gz
^ ^ ^ ^     ^   ^       ^           ^      ^    ^      ^
| | | |     |   |       |           |      |    |      |
| | | |     | owner   group       size   date time    name
| | | |     number of links to file or directory contents
| | | permissions for world
| | permissions for members of group
| permissions for owner of file: r = read, w = write, x = execute -=no permission
type of file: - = normal file, d=directory, l = symbolic link, and others...

ls -a        List the current directory including hidden files. Hidden files start
             with "."
ls -ld *     List all the file and directory names in the current directory using
             long format. Without the "d" option, ls would list the contents
             of any sub-directory of the current. With the "d" option, ls
             just lists them like regular files.

Changing file permissions and attributes:

chmod 755 filename	Changes the permissions of file to be rwx for the owner, and rx for the group and the world. (7 = rwx = 111 binary. 5 = r-x = 101 binary)
chgrp user filename	Makes file belong to the group user.
chown cliff filename	Makes cliff the owner of file.
chown -R cliff dir	Makes cliff the owner of dir and everything in its directory tree.

You must be the owner of the file/directory or be root before you can do any of these things.

Moving, renaming, and copying files:

cp file1 file2	copy a file
mv file1 newname	move or rename a file
mv file1 ~/AAA/	move file1 into sub-directory AAA in your home directory.
rm file1 [file2 ...]	remove or delete a file
rm -r dir1 [dir2...]	recursivly remove a directory and its contents BE CAREFUL!
mkdir dir1 [dir2...]	create directories
mkdir -p dirpath	create the directory dirpath, including all implied directories in the path.
rmdir dir1 [dir2...]	remove an empty directory

Viewing and editing files:

cat filename	Dump a file to the screen in ascii.
more filename	Progressively dump a file to the screen: ENTER = one line down SPACEBAR = page down q=quit
less filename	Like more, but you can use Page-Up too. Not on all systems.
vi filename	Edit a file using the vi editor. All UNIX systems will have vi in some form.
emacs filename	Edit a file using the emacs editor. Not all systems will have emacs.
head filename	Show the first few lines of a file.
head -n filename	Show the first n lines of a file.
tail filename	Show the last few lines of a file.
tail -n filename	Show the last n lines of a file.

Wednesday, 17 October 2012

Performance Tuning in IBM InfoSphere DataStage

Performance is a key factor in the success of any data warehousing project. Care for optimization and performance should be taken into account from the inception of the design and development process. Ideally, a DataStage® job should process large volumes of data within a short period of time. For maximum throughput and performance, a well performing infrastructure is required, or else the tuning of DataStage® jobs will not make much of a difference.

One of the primary steps of performance tuning is to examine the end-to-end process flow within a DataStage® job and understand which steps in the job are consuming the most time and resources. This can be done in the several ways:

1. The job score shows the generated processes, operator combinations, data sets, frame-work inserted sorts, buffers and partitions in the job. Score can be generated by setting the APT_DUMP_SCORE environment variable to TRUE before running the job. It also provides information about the node-operator combination. A score dump can help detect redundant operators, which can be used in modification of the job design to remove them.

2. The job monitor can be accessed through IBM Infosphere DataStage® Director. It provides a snapshot of job’s performance (data distribution/skew across partitions, CPU utilization) at runtime. APT_MONITOR_TIME and APT_MONITOR_SIZE are the two environment variables that control the operation of the job monitor, which takes a snapshot every five seconds by default. This can be changed by changing the value of APT_MONITOR_TIME.

3. Performance Analysis, a new capability beginning in DataStage® 8.x, can be used to collect information, generate reports and view detailed charts about job timeline, record throughput, CPU utilization, job memory utilization and physical machine utilization (shows processes other than the DataStage® activity running on the machine). This is very useful in identifying the bottlenecks during a job’s execution. Performance Analysis can be enabled through a job property on the execution tab, which collects data at runtime. (Note: By default, this option is disabled)

4. Resource Estimation, a toolbar option available in DataStage® 8.x, can be used to determine the system requirements needed to execute a particular job based on varying source data volumes and/or to analyze whether the current infrastructure can support the jobs that have been created.

There are several key aspects that could affect the job performance and these should be taken into consideration during the job design:

• Parallel configuration files allow the degree of parallelism and resources used by parallel jobs to be set dynamically at runtime. Multiple configuration files should be used to optimize overall throughput and to match job characteristics to available hardware resources.
• A DataStage® job should not be overloaded with stages. Each additional stage in a job reduces the resources available for the other stages in that job, which affects the job performance.
• Columns that are not needed should not be propagated through the stages and jobs. Unused columns make each row transfer from one stage to the next more expensive. Removing these columns minimizes memory usage and optimizes buffering.
• Runtime column propagation (RCP) should be disabled in jobs to avoid unnecessary column propagation.
• By setting the $OSH_PRINT_SCHEMAS environment variable, we can verify that runtime schemas match the job column definitions. Avoid unnecessary data type conversions.
• Proper partitioning significantly improves overall job performance. Record counts per partition can be displayed by setting the environment variable, $APT_RECORD_COUNTS. Ideally, these counts should be approximately equal. Partitioning should be set in such a way so as to ensure an even data flow across all partitions, and data skew should be minimized. If business rules dictate otherwise, then repartitioning should be done as early as possible to have a more balanced distribution which will lead to improved performance of downstream stages.
• DataStage® attempts to combine stages (operators) into a single process, and operator combination is intended to improve overall performance and reduce resource usage. Avoid repartitioning and use ‘Same’ partitioning for operator combination to occur. However, in some circumstances, operator combination may negatively impact performance and in such cases, all the combinations should be disabled by setting $APT_DISABLE_COMBINATION=TRUE.
• Do not sort the data unless necessary. Sorts done on a database (using ORDER BY clause) are usually much faster than those done in DataStage®. Hence, sort the data when reading from the database if possible instead of using the Sort Stage or sorting on the input link.
• Sort order and partitioning are preserved in parallel datasets. If data has already been partitioned and sorted on a set of key columns, check the ″Don’t sort, previously sorted″ option for the key columns in the Sort Stage. When reading from these data sets, partitioning and sorting can be maintained by using the ‘Same’ partitioning method.
• Datasets store data in native internal format (no conversion overhead) and preserve partitioning and sort order. They are parallelized unlike sequential files and hence, are much faster. Datasets must therefore be used to land intermediate results in a set of linked jobs.
• Use Join Stage as opposed to Lookup for handling huge volumes of data. Lookup is most appropriate when the reference data is small enough to fit into available physical memory. Sparse lookup is appropriate if the driver to the Lookup is significantly smaller than the reference input (1:100).
• Avoid using multiple Transformer Stages when the functionality could be incorporated into a single stage. Use Copy, Filter, or Modify stages instead of Transformer for simple transformation functions like renaming or dropping columns, type conversions, filtering out based on certain constraints, mapping a single input link to multiple output links, etc.
• As much as possible, minimize the number of stage variables in a Transformer Stage as that affects performance, and also avoid unnecessary function calls.
• If existing Transformer-based jobs do not meet performance requirements and a complex reusable logic needs to be incorporated in the job, consider building your own custom stage.
• Data should not be read from Sequential files using ‘Same’ partitioning.
• Sequential files can be read in parallel by using the ‘Multiple readers per node’ option. Setting $APT_IMPORT_BUFFER_SIZE and $APT_EXPORT_BUFFER_SIZE environment variables may also improve performance of Sequential files on heavily loaded file servers.
• Use Hash method in Aggregators only when the number of distinct key column
values is small. A Sort method Aggregator should be used when the number
of distinct key values is large or unknown.
• SQL statements in Database stages can be tuned for performance. Appropriate indexes on tables guarantee a better performance of DataStage® queries.
• ‘Array Size’ and ‘Record Count’ numerical values in Database stages can be tuned for faster inserts and updates. Default values are usually very low and may not be optimal.
• Best choice of database stages is to use Connector stages for maximum parallel performance and functionality.
Conclusion: Performance issues can be avoided by following the above best practices and performance guidelines. ‘Performance Analysis’ and ‘Resource Estimation’ functionalities can be used to gather detailed performance related data, and assist with more complicated scenarios.

Special Thanks to These References:
1.Information Server Documentation – Parallel Job Advanced Developer Guide
2.http://www.element61.be/e/resourc-detail.asp?ResourceId=188
3.http://DataStage®developer.blogspot.com/2008/01/DataStage®-performance-tuning.html

Monitoring Datastage Jobs

The Monitor window in Datastage Director -

Datastage Job Monitor is accessible through Datastage Director.
This option appears by right clicking on any Job name in Datastage Director client.

Select the Job on the Director list window. Go to Tools --> View Monitor and select the job.

This basically displays summary information about relevant stages in a job that is being run or validated. It has a tree structure containing stages in a job and associated links. For server jobs active stages are shown (active stages are those that perform processing rather than ones reading or writing a data source). For parallel jobs, all stages are shown.

You can display more information about a stage and set the server update interval.

If you are monitoring a parallel job, and have not chosen to view instance information, the monitor displays information for Parallel jobs as follows:

If a stage is running in parallel, then x N is appended to the stage name, where N gives how many instances are running.
If a stage is running in parallel then the Num Rows column shows the total number of rows processed by all instances. The Rows/sec is derived from this value and shows the total throughput of all instances.
If a stage is running in parallel then the %CP may be more than 100 if there are multiple CPUs on the server. For example, on a machine with four CPUs, %CP could be as high as 400 where a stage is occupying 100% of each of the four processors, on the other hand, if the stage is occupying only 25% of each processor the %CP would be 100%.

From the figure above, we see several columns which are elaborated below:

Stage name:
This column displays the names of stages that perform processing (for example, Transformer stages). Stages that represent data sources or data marts are not displayed.

Link Type:
When you have selected a link in the tree, displays the type of link as follows:

<<Pri primary input link
<Ref input link
>Out output link
>Rej output link for rejected rows

Status:
The status of the stage. The possible states are:

Aborted. The process finished abnormally at this stage.
Finished. All data has been processed by the stage.
Ready. The stage is ready to process the data.
Running. Data is being processed by the stage.
Starting. The processing is starting.
Stopped. The processing was stopped manually at this stage.
Waiting. The stage is waiting to start processing.

Num rows:
This column displays the number of rows of data processed so far by each stage on its primary input.

Started at:
This column shows the time that processing started on the server.

Elapsed time:
This column shows the elapsed time since processing of the stage began.

Rows/sec:
This column displays the number of rows that are processed per second.

%CP:
The percentage of CPU the stage is using (you can turn the display of this column on and off from the shortcut menu)

Parallel Job Instance Information:
To monitor instances of parallel jobs individually, choose Show Instances from the shortcut menu. The monitor window will then show each instance of a stage as a sub-branch under the ‘parent’ stage, The monitor displays the information for all stage instances under the ‘parent’ stage. Only relevant information is shown for each stage instance as follows:

Summary:

The job monitor provides a useful snapshot of a job's performance at a moment of execution, but does not provide thorough performance metrics. That is, a job monitor snapshot should not be used in place of a full run of the job, or a run with a sample set of data. Due to buffering and to some job semantics, a snapshot image of the flow might not be a representative sample of the performance over the course of the entire job.

The CPU summary information provided by the job monitor is useful as a first approximation of where time is being spent in the flow. However, it does not include any sorts or similar that might be inserted automatically in a parallel job. For these components, the score dump can be of assistance. See "Score Dumps".

A worst-case scenario occurs when a job flow reads from a data set, and passes immediately to a sort on a link. The job will appear to hang, when, in fact, rows are being read from the data set and passed to the sort.

The operation of the job monitor is controlled by two environment variables: APT_MONITOR_TIME and APT_MONITOR_SIZE. By default the job monitor takes a snapshot every five seconds. You can alter the time interval by changing the value of APT_MONITOR_TIME, or you can have the monitor generate a new snapshot every so-many rows by following this procedure:

1.Select APT_MONITOR_TIME on the InfoSphere DataStage Administrator environment variable dialog box, and press the set to default button.
2.Select APT_MONITOR_SIZE and set the required number of rows as the value for this variable.

Tuesday, 16 October 2012

ETL vs ELT

One among the many useful articles by Vincent McBurney about comparison between ETL and ELT.

Every now and then I come across a blog entry that reminds me there are people out there who know a lot more about my niche than I do! This is fortunate as this week it has helped me understand ELT tools.

ETL versus ELT and ETLT

The world of data integration has it's own Coke versus Pepsi challenge - it's called ETL versus ELT. Not as exciting as Aliens versus Predator, not as compelling as Ali versus Frasier and not as sexy as Erin Brockovich versus whatever company that was ... but an important battle if you are in charge of a multi million dollar IT budget.

ETL (Extract, Transform and Load) is the coca cola in the challenge with Informatica and DataStage the champions in terms of license fees and market share. It is made up of software that transforms and migrates data on most platforms with or without source and target databases. Business Objects, SAS, Microsoft SSIS, Ab Initio and Cognos are also in the ETL corner.

ELT (Extract, Load and Transform) is the challenger and is now largely driven by RDBMS vendor Oracle with Oracle Warehouse Builder and Sunopsis. It consists of software that transforms and migrates data in a database engine, often by generating SQL statements and procedures and moving data between tables.

ELT technology was constrained by what the database was capable and since many came from RDBMS vendors they tended to be suitable for just one database platform. Eg. Oracle Warehouse Builder and Microsoft DTS. They were also lacking functionality as the vendor was more concerned with building a database rather than an ELT tool. Sunopsis was an exception as an ELT tool not owned by an RDBMS vendor (until theOracle acquired them ).

Informatica has recently moved into the the ETLT (Extract, Transform, Load and Transform) area with database pushdown optimization. This is standard ETL delivering to a target database and some extra sexy moves done moving it into more tables. Microsoft SSIS also has good ETLT capabilities within the SQL Server database.

Pros of each
I haven't had a lot of experience with ELT products but fortunately Dan Lindstedt from the B-Eye-Network blogs has been talking about this topic for years now and his recent entryETL, ELT - Challenges and Metadata has a great comparison. Here are his pros of each list below, visit his blog for further discussion and the Cons of each tool:

Pros:

* ETL can balance the workload / share the workload with the RDBMS
* ETL can perform more complex operations in single data flow diagrams (data maps)
* ETL can scale with separate hardware.
* ETL can handle Partitioning and parallelism independent of the data model, database layout, and source data model architecture.
* ETL can process data in-stream, as it transfers from source to target
* ETL does not require co-location of data sets in order to do it's work.
* ETL captures huge amounts of metadata lineage today.
* ETL can run on SMP or MPP hardware.

I would add to this data quality. The ETL tools have a head start over ELT in terms of data quality integration with Informatica and DataStage integrating closely. The row-by-row processing method of ETL works well with third party products such as data quality or business rule engines.

Pros of ELT.

* ELT leverages RDBMS engine hardware for scalability
* ELT keeps all data in the RDBMS all the time
* ELT is parallelized according to the data set, and disk I/O is usually optimized at the engine level for faster throughput.
* ELT Scales as long as the hardware and RDBMS engine can continue to scale.
* ETL can achieve 3x to 4x the throughput rates on the appropriately tuned MPP RDBMS platform.

I'm not sure whether that final point refers to data that is being processed from one RDBMS table to another on the same server or whether it also applies to cross database migrations. If you are moving between database platforms ELT may need to transfer the data and save it before it commences transformation whereas ETL can transfer and transform in the one stream.
Another Pro of ELT is that once the database is on the target platform it no longer places stress on the network, all further transformation is done on the RDBMS server. These Pros of ELT is why ETL tools have an ELT capability, either the simple interface of DataStage in running user-defined SQL and procedures or the advanced capabilities of Informatica to generate SQL transformation code.

Informatica Pushdown ELT
In 2006 Informatica made an each way bet in the challenge by putting ELT style push down optimization into the PowerCenter ETL tool. You can see a very good list of 8.1.1 and Pushdown optimization best practices at the ETL/ELT forum.

Sunopsis ELT
You can see Sunopsis ELT in action at Mark Rittmans Sunopsis Data Conductor : Creating an Oracle Project and in Moving Global Electronics Data using Sunopsis he evaluates the Oracle SQL generated.

DataStage ELT
In DataStage ELT is available through running a SQL statement from a database stage or by calling a stored procedure. It can generate SQL statements but not SQL procedural code so it is not as flexible or advanced as PowerCenter or Sunopsis. IBM may add this functionality within the next 18 months. They already have a DB2 ELT tool in the DB2 DataWarehouse Edition in the form of a SQL language generator for populating a Data Warehouse.

The Wrap
I am not convinced advanced ELT is good an ETL tool as it increases upgraditis.
Upgraditis afflicts ETL tools when they struggle to keep up with all the upgrades in the products the integrate with. Good ETL tools have native access to most common database engines and ERP systems. They constantly need to upgrade and certify these items against new versions of these products. When Informatica added pushdown optimization they bought into a whole new area of upgraditis and will have to keep the pushdown compatible with new versions of all the databases it supports.

Datastage Execution Flow

When you execute a job, the generated OSH and contents of the configuration
file ($APT_CONFIG_FILE) is used to compose a “score”. This is similar to a SQL
query optimization plan.

At runtime, IBM InfoSphere DataStage identifies the degree of parallelism and
node assignments for each operator, and inserts sorts and partitioners as
needed to ensure correct results. It also defines the connection topology (virtual
data sets/links) between adjacent operators/stages, and inserts buffer operators
to prevent deadlocks (for example, in fork-joins). It also defines the number of
actual OS processes. Multiple operators/stages are combined within a single OS
process as appropriate, to improve performance and optimize resource
requirements.

The job score is used to fork processes with communication interconnects for
data, message and control3. Processing begins after the job score and
processes are created. Job processing ends when either the last row of data is
processed by the final operator, a fatal error is encountered by any operator, or
the job is halted by DataStage Job Control or human intervention such as
DataStage Director STOP.

Job scores are divided into two sections — data sets (partitioning and collecting)
and operators (node/operator mapping). Both sections identify sequential or
parallel processing.

The execution (orchestra) manages control and message flow across processes
and consists of the conductor node and one or more processing nodes as shown
in Figure 1-6. Actual data flows from player to player — the conductor and
section leader are only used to control process execution through control and
message channels.

# Conductor is the initial framework process. It creates the Section Leader (SL)
processes (one per node), consolidates messages to the DataStage log, and
manages orderly shutdown. The Conductor node has the start-up process.
The Conductor also communicates with the players.

Note: You can direct the score to a job log by setting $APT_DUMP_SCORE.
To identify the Score dump, look for “main program: This step....”.

# Section Leader is a process that forks player processes (one per stage) and
manages up/down communications. SLs communicate between the
conductor and player processes only. For a given parallel configuration file,
one section leader will be started for each logical node.

# Players are the actual processes associated with the stages. It sends stderr
and stdout to the SL, establishes connections to other players for data flow,
and cleans up on completion. Each player has to be able to communicate
with every other player. There are separate communication channels
(pathways) for control, errors, messages and data. The data channel does
not go through the section leader/conductor as this would limit scalability.

Data flows directly from upstream operator to downstream operator.
========================================================================

Summarizing the above concept:

Datastage Job Execution: The Orchestra

# Conductor - Initial process
- Composes the Score
- Creates Section Leader process (one/node)
- Consolidates messages to DataStage log.
- Manages orderly shutdown

# Section Leader (one per Node)
- Forks Player process (one per stage)
- Manages up/down communication

# Players - The actual processes associated with stages
- Combined players: one process only
- Sends stderr, stdout to Section Leader
- Establish connections to other players for data flow.
- Clean up upon completion.

The Conductor nod e has the start-up process. Creates the score. Starts up section leaders.
Section leaders communicate with the conductor only. The Conductor communicates with the players.

Runtime Control and Data Networks:

Every player has to be able to communicate with every other player. There are separate communication channels(pathways) for control, messages, errors, and data. Note that the data channel does not go through the section leader/conductor, as this would limit scalability. Data flows directly from upstream operator to downstream operator using APT_Communicator class.

Saturday, 13 October 2012

Quick Datastage Tips

While designing/coding any kind or level of job, few tips may become very handy.
This would result into designing of Datastage jobs efficiently. More the efficiency, better would be the performance of the job once executed. These handy tips are also useful in reducing the job designing/coding time and better debugging approach.

Partitioning

Partitioning methods should be used widely in parallel jobs to achieve optimal performance. Repartitioning of data within a job should be avoided as much as possible.
- Same partitioning should not be used to read data from a partitioned database, if the number of partitions in the database is very high compared to the number of nodes specified in the APT configuration file. “Same” partitioning will override the partitioning specified in the APT configuration file and the job will become resource intensive.
- In Lookup stages if the reference data is significantly lesser in volume compared to the input dataset then “Entire” partitioning should be used in the reference link and the input link partitioning should be left as “Auto” (default).
- In Lookup stages if the input & reference data are of comparable volumes, then hash partitioning should be used in both the links
- The input dataset to a Remove Duplicate stage should be hash partitioned and sorted on the key used for deduplication
- The partitioning of the input dataset to a Join stage should be either left as “Auto” or should be hashed & sorted on the key fields used to join the data.
- The partitioning in the input reference links should be set to “Entire” if the lookup type is “Normal” and the volume of reference data is significantly low(in thousands).
- The partitioning in the input and reference links should be “Hash” when the volume is about equal in both the links

Sequential File Stage

Particular attention must be taken when importing fields using the Sequential File stage
- If the incoming columns are variable-length data types (eg. INTEGER, DECIMAL, VARCHAR), the field width column property must be set to match the fixed-width of the input column. Double-click on the column number in the grid dialog to set this column property.
- If a field is nullable, you must define the null representation (null field value, null field length) in the “Nullable” section of the column property. Double-click on the column number in the grid dialog to set these properties.
- Care should be taken when reading delimited, bounded-length VARCHAR columns (varchars with the length option set). If the source file has fields with values longer than the maximum VARCHAR length, these extra characters will be silently discarded.

The following points should be taken into account to get an optimal performance in jobs using sequential file stages.
- Use multiple node reads or multiple readers per node to increase the performance of extracts from fixed width sequential files
- Add a reject link to capture the records that do not conform to the metadata definitions
- Avoid reading from sequential files using the Same partitioning method. Unless you have specified more than one source file, this will result in the entire file being read into a single partition, making the entire downstream flow run sequentially unless you explicitly repartition.

Database Stages

In general, it is best to use the native parallel database stages (DB2/UDB Enterprise, Oracle Enterprise) whenever possible for maximum parallel performance and features.

DB2 Stages

The DB2/UDB Enterprise (native parallel) stage should be used for reading from, performing lookups against, and writing to a DB2 Enterprise Server Edition database with Database Partitioning Feature
- As a native, parallel component, the DB2/UDB Enterprise stage is designed for maximum performance and scalability against very large partitioned DB2 UNIX databases.
- The DB2/API (plug-in) stage should only be used to read from and write to DB2 databases on other platforms (such as mainframe editions through DB2-Connect). DB2 API stage should be used to access data from DB2 database, installed in a different operating system than the ETL server. Sparse Lookup is not supported through the DB2/API stage.
- A DataStage job can connect to only one database instance and if both DB2 Enterprise stage and DB2 API stages are used in the same jobs then the instance information provided in the DB2 stage (DB2API or DB2 ENT) that comes first in the job will be picked up. Hence care should be taken to make sure that the database used in the later DB2 stage is present in the catalog of the first instance.
- Queries that involve joining data from collocated tables in a multi partition database should be optimized by using NODENUMBER clause wherever applicable. The queries submitted through DB2 Enterprise stage would bypass DB2 optimizer and hence if the query is not optimized the extraction process will spawn too many unnecessary connections across the partitions available in the database.

Oracle Stage

- Oracle stages in DataStage EE will always use table definition as defined in the underlying database, regardless of explicit job design metadata (Data Type, Nullability, etc)
- When building jobs that use Oracle sources or targets, always import Oracle table definitions using the “orchdbutil option” in DataStage Designer/Manager to avoid unexpected data type conversions.
- If the password has a special character in it, then the password should be enclosed within double quotes.

Dataset Stage

- Dataset stages can be used as intermediate stages to land data and the data stored in the dataset is partitioned. So when accessing the data from a dataset the read process will be quicker
- While using datasets created by preceding jobs, care should be taken to ensure that the same APT configuration file is used by both the jobs. If different configuration files are used the job may not read the entire data present in the dataset.
- All the datasets should be stored in the appropriate “work” directory
- Datasets created for landing the intermediate data and that are not necessary for the next run of the job should be purged. This would release the memory occupied by the datasets and increase the scratch disk space.

Processing Stages

This sectin will describe the best practices that can be applied for almost every individual stage that can be used in DataStage EE to transform the input data.

Transformer Stage

Within DataStage EE, the Transformer stage always generates “C” code which is then compiled to a parallel component. For this reason, it is important to minimize the number of transformers, and to use other stages (Copy, Filter, Switch, etc) when derivations are not needed.

- Transformer should never be used as a placeholder to ease job maintenance. Instead use the Copy stage as a job placeholder.
- Transformer should never be used just to split the input dataset into multiple output datasets based on the values of certain input columns. Filter stage should be used to do such operations.
- Transformer should never be used just to do null handling operations. Modify stage should be used to perform such operations.
- When evaluating expressions for output derivations or link constraints, the Transformer will discard any row that has a NULL value used in the expression. This is because the rules for arithmetic and string handling of NULL values are by definition undefined.
Always test for null values before using a column in an expression
Eg. If ISNULL(link.col) Then… Else…
- It may be possible to implement complex derivation expressions using regular patterns by Lookup tables instead of using a Transformer with nested IF THEN ELSE derivations.
Eg. If A=0,1,2,3 Then B=”X” If A=4,5,6,7 Then B=”C”
- Output derivations are evaluated BEFORE any type conversions on the assignment. Therefore, it is important to make sure type conversion has taken place before a row reaches the Transformer.
Eg. TrimLeadingTrailing(string) works only if string is a VarChar field. Thus, the incoming column must be type VarChar before it is evaluated in the Transformer.
- If constraints are used in all the output links of a transformer then add a reject link to capture the records that does not match the constraints.

Stage Variables

Stage variables should be used wisely only when it is required
- It can be used when the same output derivation is used in multiple output links. This would eliminate the expression being executed multiple times.
- It can be used when the transformation logic involves accessing the values of preceding records
- Care should be taken to use the correct data type when defining stage variables

Lookup Stage

Lookup stage should be used to join the source data with the reference data, when the volume of data in the reference table is small enough to fit into available physical memory. Each lookup reference requires a contiguous block of physical memory.
- The condition feature could be used to validate the key fields used for the lookup

- While doing a lookup, if the situation is such that you have the same column name in the input link and reference link and if you want to map the reference value to the output column then the output column should be given a different name. Then the column could be renamed by using a subsequent Copy stage. If the column is NOT renamed then the source column will be mapped to the output column irrespective of the column mapping done in the lookup stage.

The following diagrams represent the situation described above

Normal lookup

The volume of data that can be referenced using a normal lookup depends on the memory available in the server. The rule of thumb is to use normal lookup for volumes lesser than 1 million records.
- If the reference data is very less when compared to the source data then the “entire” partitioning could be used for the reference link. This would increase the performance significantly
- If the reference data is almost equal to the source data then “Hash” partitioning should be done on the lookup keys on both the input & reference links

Sparse Lookup

If the reference to a Lookup is directly from a DB2 or Oracle table, and the number of input rows is significantly smaller (e.g. 1:100 or more) than the number of reference rows, a Sparse Lookup may be appropriate.
- In Sparse lookups the query in the lookup stage will be executed as many number of times as the number of input records. So the query in the lookup stage should be fairly simple. If the query in the lookup stage is complicated & time consuming (with inner queries, etc..) then a join stage should be used.

Join Stage

- Join stage should never be used for joining huge volume of source data with significantly lesser volume of reference data. This would slow down the entire process
- Join stage should be used when data in both source & reference datasets are huge (in millions)
- If explicit hash partitioning is done in the input links of a join stage, then data has to be repartitioned if there is a possibility of skew in the input/reference datasets.
- Join stage does not provide reject handling for unmatched records (such as in an InnerJoin scenario). If un-matched rows must be captured or logged, an OUTER join operation must be performed. In an OUTER join scenario, all rows on an outer link (eg. Left Outer, Right Outer, or both links in the case of Full Outer) are output regardless of match on key values. During an Outer Join, when a match does not occur, the Join stage inserts NULL values into the unmatched columns. Care must be taken to change the column properties to allow NULL values before the Join. This is most easily done by inserting a Copy stage and mapping a column from NON-NULLABLE to NULLABLE. After the joining operation the columns from the outer link can be checked for null values and the records having null values can be rejected.
- In some cases, it is simpler to use a Column Generator to add an ‘indicator’ (DUMMY) column, with a constant value, to each of the outer links and test that column for the constant after performing the join.

Joining input with reference data (Join vs Lookup)

There are multiple ways of joining the input data with the reference data. The optimal way of referring data should be selected by considering the following criteria
- Input data volume
- Reference data volume
- Join criteria
- Reject handling criterion
- Performance

- If the input and reference data is collocated in the database, joining the data within the SQL will give the optimal performance.
- In certain cases if the input data is significantly lesser when compared to the reference data then loading the input (processed) data into a work table and then joining it with reference data will give a much better performance when compared to DataStage joins.

If the data is not collocated or if the join has to happen between processed data then DataStage provides the following options.

Join, Lookup and Merge stages perform the same function of joining reference data with the input data.
- Join stage can be used to perform inner and all types of outer joins. It does not have any volume constraints. It does not have a reject link and hence reject handling should be done by using some lookup columns or dummy columns in subsequent stages. While joining huge volumes of data within DataStage memory space required/available in the scratch disk and other jobs running during that window should be taken into consideration and planned accordingly.
- Lookup stage can be used to perform inner and left outer joins. Normal lookups are in memory lookups and the reference data will be stored in the system memory and if the reference volume increases the data will be moved to the paging space. With increasing volumes the performance of Normal lookups will be poor. Sparse lookups should be used in the case of the input data being significantly lesser than the reference data (say 1:100). This will fire a query to the database for every input record and hence would be performance issue for increased number of input records.
- Using Merge stage we can perform inner and left outer joins. The input is considered as Master dataset and all the other reference datasets are considered as Update datasets. The advantage of using a Merge stage is having individual reject links for every Update dataset. Merge stage can be used to join data of comparable volumes in both input and reference datasets.

Aggregator Stage

- The aggregation field types should be defined as double instead of decimal types. This should improve the performance.
- If the aggregation file types have to be set to decimal then setting up the variables APT_DECIMAL_INTERM_PRECISION & APT_DECIMAL_INTERM_SCALE is recommended. When set, define precision and scale for intermediate decimal objects If any of these variables are not set, the default value of DEFAULT_INTERM_PRECISION for precision and DEFAULT_INTERM_SCALE for scale will be used
- Null values in the grouping columns will distort results. Include a transformer before the Aggregator to exclude nulls.

Copy Stage

- Copy Stage should be used as the placeholder stage whenever there is a need for splitting the input dataset into multiple outputs
- It can also be used for simple renaming of columns copy stage should be used instead of using the expensive Transformer stage

Filter Stage

- Filter stage should be used to split the input dataset in multiple output datasets.
- If null values are expected as values for the columns used in the Where Clause, “Nulls Value” option should be set to define whether null values are treated as 'Greater Than' or 'Less Than' other values.
- Filter stages should be moved to the beginning of the job as much as possible to avoid processing unwanted records.

Modify stage

- If the Source to Target transformations require just null handling and keeping/dropping of columns then a Modify stage should be used instead of using the Transformer stage.

Remove Duplicate Stage
Remove Duplicate stage is used to remove the duplicate records based on the key fields defined. - Input data has to be hash partitioned and sorted to apply the de-duplication logic properly.
- If the input records are not identical then proper de-duplication logic should be defined to meet the business requirements.
- The sorting and hash partitioning required for remove duplicate stage should be done in a separately in a sort stage preceding the remove duplicate stage. Performing the hash partitioning & sorting in the input link of the remove duplicate stage, causes some memory issues randomly.

Funnel Stage

The funnel requires all input links to have identical schemas (column names, types, attributes including nullability). The single output link matches the input schema.
-Type of the funnel should be defined as a “Continuous Funnel” unless there is a valid reason to specify that otherwise. Continuous file type will be having better performance when compared to the other types.
Target Stages
This section will describe the best practices for all the stages that can be used as a source stage in a parallel job

Sequential Files

- If Null values are expected as values for some of the input fields then care should be taken to define the Null Values in the formatting tab

DB2 Stage

Two types of DB2 stages (DB2 Enterprise & DB2 API) can be used to load data into any DB2 tables. The recommendation is to use DB2 Enterprise stage to load data into DB2 tables that reside in the same operating system as that of the ETL server. DB2 API stage should be used to load data into DB2 tables that are in a different operating system. E.g. Mainframe DB2 tables

DB2 Enterprise

- DB2 Enterprise stage will execute in parallel and the partitioning in the input link of the stage should be left as DB2 (default). Overriding this partitioning type will add to the overhead of the database
- Load option can be used in this stage if the requirement is just to insert huge volume of data into the target table. The load should be a non-recoverable load otherwise if the job fails the entire table space will be locked. To make the load as non-recoverable select the option ‘Non-Recoverable Transactions’ as ‘True’
- Upsert option can be used to insert/update data into the target table. Care should be taken in the case of updating a table with referential integrity. In this case the array size and row commit interval should be having the same value
- If the requirement is to commit the records at the completion of load, then the row commit interval should be set as a very high value

DB2 API

- DB2 API stage runs sequentially and the throughput will be very less when compared to the Enterprise stage (max of 2000 rows/sec)
- The Database used in the DB2 API stage should be cataloged in the default DB2 instance (mentioned in the dsenv file) for the connections to be established properly
- When using dual server APT_CONFIG_FILE (say any *prod01_prod02*.apt) values, the DB2 client installation should be verified in both the servers

Oracle Stage

- If the password has a special character in it, then the password should be enclosed within double quotes.
- The partitioning of the input link of the Oracle stage should be left as “Auto” (default)
- Jobs that are updating/inserting/deleting rows on Oracle tables that contain bitmap indexes with low cardinality tend to create database deadlocks. In such scenarios the bitmap indexes should be changed to a b-tree index

External targets

In case the external target is invoking a C program which is interacting with a file larger than 2 GB; call open64() statement instead of calling open(). The same can be achieved by using the open() API but compiling with the _LARGE_FILE_API macro defined.

Thursday, 18 October 2012

Basics Of UNIX

Wednesday, 17 October 2012

Performance Tuning in IBM InfoSphere DataStage

Monitoring Datastage Jobs

Tuesday, 16 October 2012

ETL vs ELT

Datastage Execution Flow

Saturday, 13 October 2012

Quick Datastage Tips

Total Pageviews

About Me