Saturday, 13 October 2012

Quick Datastage Tips


While designing/coding any kind or level of job, few tips may become very handy.
This would result into designing of Datastage jobs efficiently. More the efficiency, better would be the performance of the job once executed. These handy tips are also useful in reducing the job designing/coding time and better debugging approach.


Partitioning

Partitioning methods should be used widely in parallel jobs to achieve optimal performance.  Repartitioning of data within a job should be avoided as much as possible.
- Same partitioning should not be used to read data from a partitioned database, if the number of partitions in the database is very high compared to the number of nodes specified in the APT configuration file. “Same” partitioning will override the partitioning specified in the APT configuration file and the job will become resource intensive.
- In Lookup stages if the reference data is significantly lesser in volume compared to the input dataset then “Entire” partitioning should be used in the reference link and the input link partitioning should be left as “Auto” (default).
- In Lookup stages if the input & reference data are of comparable volumes, then hash partitioning should be used in both the links
- The input dataset to a Remove Duplicate stage should be hash partitioned and sorted on the key used for deduplication
- The partitioning of the input dataset to a Join stage should be either left as “Auto” or should be hashed & sorted on the key fields used to join the data.
- The partitioning in the input reference links should be set to “Entire” if the lookup type is “Normal” and the volume of reference data is significantly low(in thousands).
- The partitioning in the input and reference links should be “Hash” when the volume is about equal in both the links


Sequential File Stage

Particular attention must be taken when importing fields using the Sequential File stage
- If the incoming columns are variable-length data types (eg. INTEGER, DECIMAL, VARCHAR), the field width column property must be set to match the fixed-width of the input column. Double-click on the column number in the grid dialog to set this column property.
- If a field is nullable, you must define the null representation (null field value, null field length) in the “Nullable” section of the column property. Double-click on the column number in the grid dialog to set these properties.
- Care should be taken when reading delimited, bounded-length VARCHAR columns (varchars with the length option set). If the source file has fields with values longer than the maximum VARCHAR length, these extra characters will be silently discarded.

The following points should be taken into account to get an optimal performance in jobs using sequential file stages.
- Use multiple node reads or multiple readers per node to increase the performance of extracts from fixed width sequential files
- Add a reject link to capture the records that do not conform to the metadata definitions
- Avoid reading from sequential files using the Same partitioning method. Unless you have specified more than one source file, this will result in the entire file being read into a single partition, making the entire downstream flow run sequentially unless you explicitly repartition.


Database Stages

In general, it is best to use the native parallel database stages (DB2/UDB Enterprise, Oracle Enterprise) whenever possible for maximum parallel performance and features.


DB2 Stages

The DB2/UDB Enterprise (native parallel) stage should be used for reading from, performing lookups against, and writing to a DB2 Enterprise Server Edition database with Database Partitioning Feature
- As a native, parallel component, the DB2/UDB Enterprise stage is designed for maximum performance and scalability against very large partitioned DB2 UNIX databases.
- The DB2/API (plug-in) stage should only be used to read from and write to DB2 databases on other platforms (such as mainframe editions through DB2-Connect). DB2 API stage should be used to access data from DB2 database, installed in a different operating system than the ETL server. Sparse Lookup is not supported through the DB2/API stage.
- A DataStage job can connect to only one database instance and if both DB2 Enterprise stage and DB2 API stages are used in the same jobs then the instance information provided in the DB2 stage (DB2API or DB2 ENT) that comes first in the job will be picked up. Hence care should be taken to make sure that the database used in the later DB2 stage is present in the catalog of the first instance.
-  Queries that involve joining data from collocated tables in a multi partition database should be optimized by using NODENUMBER clause wherever applicable. The queries submitted through DB2 Enterprise stage would bypass DB2 optimizer and hence if the query is not optimized the extraction process will spawn too many unnecessary connections across the partitions available in the database.


Oracle Stage

- Oracle stages in DataStage EE will always use table definition as defined in the underlying database, regardless of explicit job design metadata (Data Type, Nullability, etc)
- When building jobs that use Oracle sources or targets, always import Oracle table definitions using the “orchdbutil option” in DataStage Designer/Manager to avoid unexpected data type conversions.
- If the password has a special character in it, then the password should be enclosed within double quotes.


Dataset Stage

- Dataset stages can be used as intermediate stages to land data and the data stored in the dataset is partitioned. So when accessing the data from a dataset the read process will be quicker
- While using datasets created by preceding jobs, care should be taken to ensure that the same APT configuration file is used by both the jobs. If different configuration files are used the job may not read the entire data present in the dataset.
- All the datasets should be stored in the appropriate “work” directory
- Datasets created for landing the intermediate data and that are not necessary for the next run of the job should be purged. This would release the memory occupied by the datasets and increase the scratch disk space.

Processing Stages

This sectin will describe the best practices that can be applied for almost every individual stage that can be used in DataStage EE to transform the input data.

Transformer Stage

Within DataStage EE, the Transformer stage always generates “C” code which is then compiled to a parallel component. For this reason, it is important to minimize the number of transformers, and to use other stages (Copy, Filter, Switch, etc) when derivations are not needed.

- Transformer should never be used as a placeholder to ease job maintenance. Instead use the Copy stage as a job placeholder.
- Transformer should never be used just to split the input dataset into multiple output datasets based on the values of certain input columns. Filter stage should be used to do such operations.
- Transformer should never be used just to do null handling operations. Modify stage should be used to perform such operations.
- When evaluating expressions for output derivations or link constraints, the Transformer will discard any row that has a NULL value used in the expression. This is because the rules for arithmetic and string handling of NULL values are by definition undefined.
Always test for null values before using a column in an expression
         Eg. If ISNULL(link.col) Then… Else…
- It may be possible to implement complex derivation expressions using regular patterns by Lookup tables instead of using a Transformer with nested IF THEN ELSE derivations.
Eg. If A=0,1,2,3 Then B=”X” If A=4,5,6,7 Then B=”C”
- Output derivations are evaluated BEFORE any type conversions on the assignment. Therefore, it is important to make sure type conversion has taken place before a row reaches the Transformer.
Eg. TrimLeadingTrailing(string) works only if string is a VarChar field. Thus, the incoming column must be type VarChar before it is evaluated in the Transformer.
- If constraints are used in all the output links of a transformer then add a reject link to capture the records that does not match the constraints.


Stage Variables

Stage variables should be used wisely only when it is required
- It can be used when the same output derivation is used in multiple output links. This would eliminate the expression being executed multiple times.
- It can be used when the transformation logic involves accessing the values of preceding records
- Care should be taken to use the correct data type when defining stage variables

Lookup Stage

Lookup stage should be used to join the source data with the reference data, when the volume of data in the reference table is small enough to fit into available physical memory. Each lookup reference requires a contiguous block of physical memory.
- The condition feature could be used to validate the key fields used for the lookup

- While doing a lookup, if the situation is such that you have the same column name in the input link and reference link and if you want to map the reference value to the output column then the output column should be given a different name. Then the column could be renamed by using a subsequent Copy stage. If the column is NOT renamed then the source column will be mapped to the output column irrespective of the column mapping done in the lookup stage.

The following diagrams represent the situation described above






Normal lookup

The volume of data that can be referenced using a normal lookup depends on the memory available in the server. The rule of thumb is to use normal lookup for volumes lesser than 1 million records.
- If the reference data is very less when compared to the source data then the “entire” partitioning could be used for the reference link. This would increase the performance significantly
- If the reference data is almost equal to the source data then “Hash” partitioning should be done on the lookup keys on both the input & reference links


Sparse Lookup

If the reference to a Lookup is directly from a DB2 or Oracle table, and the number of input rows is significantly smaller (e.g. 1:100 or more) than the number of reference rows, a Sparse Lookup may be appropriate.
- In Sparse lookups the query in the lookup stage will be executed as many number of times as the number of input records. So the query in the lookup stage should be fairly simple. If the query in the lookup stage is complicated & time consuming (with inner queries, etc..) then a join stage should be used.

Join Stage

- Join stage should never be used for joining huge volume of source data with significantly lesser volume of reference data. This would slow down the entire process
- Join stage should be used when data in both source & reference datasets are huge (in millions)
- If explicit hash partitioning is done in the input links of a join stage, then data has to be repartitioned if there is a possibility of skew in the input/reference datasets.
- Join stage does not provide reject handling for unmatched records (such as in an InnerJoin scenario). If un-matched rows must be captured or logged, an OUTER join operation must be performed. In an OUTER join scenario, all rows on an outer link (eg. Left Outer, Right Outer, or both links in the case of Full Outer) are output regardless of match on key values. During an Outer Join, when a match does not occur, the Join stage inserts NULL values into the unmatched columns. Care must be taken to change the column properties to allow NULL values before the Join. This is most easily done by inserting a Copy stage and mapping a column from NON-NULLABLE to NULLABLE.  After the joining operation the columns from the outer link can be checked for null values and the records having null values can be rejected.
- In some cases, it is simpler to use a Column Generator to add an ‘indicator’ (DUMMY) column, with a constant value, to each of the outer links and test that column for the constant after performing the join.

Joining input with reference data (Join vs Lookup)

There are multiple ways of joining the input data with the reference data. The optimal way of referring data should be selected by considering the following criteria
- Input data volume
- Reference data volume
- Join criteria
- Reject handling criterion
- Performance

- If the input and reference data is collocated in the database, joining the data within the SQL will give the optimal performance.
- In certain cases if the input data is significantly lesser when compared to the reference data then loading the input (processed) data into a work table and then joining it with reference data will give a much better performance when compared to DataStage joins.

If the data is not collocated or if the join has to happen between processed data then DataStage provides the following options.

Join, Lookup and Merge stages perform the same function of joining reference data with the input data.
- Join stage can be used to perform inner and all types of outer joins. It does not have any volume constraints. It does not have a reject link and hence reject handling should be done by using some lookup columns or dummy columns in subsequent stages.  While joining huge volumes of data within DataStage memory space required/available in the scratch disk and other jobs running during that window should be taken into consideration and planned accordingly.
- Lookup stage can be used to perform inner and left outer joins. Normal lookups are in memory lookups and the reference data will be stored in the system memory and if the reference volume increases the data will be moved to the paging space. With increasing volumes the performance of Normal lookups will be poor. Sparse lookups should be used in the case of the input data being significantly lesser than the reference data (say 1:100). This will fire a query to the database for every input record and hence would be performance issue for increased number of input records.
- Using Merge stage we can perform inner and left outer joins. The input is considered as Master dataset and all the other reference datasets are considered as Update datasets. The advantage of using a Merge stage is having individual reject links for every Update dataset. Merge stage can be used to join data of comparable volumes in both input and reference datasets.

Aggregator Stage

- The aggregation field types should be defined as double instead of decimal types. This should improve the performance.
- If the aggregation file types have to be set to decimal then setting up the variables APT_DECIMAL_INTERM_PRECISION & APT_DECIMAL_INTERM_SCALE is recommended. When set, define precision and scale for intermediate decimal objects  If any of these variables are not set, the default value of DEFAULT_INTERM_PRECISION for precision and  DEFAULT_INTERM_SCALE for scale will be used
- Null values in the grouping columns will distort results. Include a transformer before the Aggregator to exclude nulls.

Copy Stage

- Copy Stage should be used as the placeholder stage whenever there is a need for splitting the input dataset into multiple outputs
- It can also be used for simple renaming of columns copy stage should be used instead of using the expensive Transformer stage

Filter Stage

- Filter stage should be used to split the input dataset in multiple output datasets.
- If null values are expected as values for the columns used in the Where Clause, “Nulls Value” option should be set to define whether null values are treated as 'Greater Than' or 'Less Than' other values.
- Filter stages should be moved to the beginning of the job as much as possible to avoid processing unwanted records.

Modify stage

- If the Source to Target transformations require just null handling and keeping/dropping of columns then a Modify stage should be used instead of using the Transformer stage.

Remove Duplicate Stage
Remove Duplicate stage is used to remove the duplicate records based on the key fields defined. - Input data has to be hash partitioned and sorted to apply the de-duplication logic properly.
- If the input records are not identical then proper de-duplication logic should be defined to meet the business requirements.
- The sorting and hash partitioning required for remove duplicate stage should be done in a separately in a sort stage preceding the remove duplicate stage. Performing the hash partitioning & sorting in the input link of the remove duplicate stage, causes some memory issues randomly.

Funnel Stage

The funnel requires all input links to have identical schemas (column names, types, attributes including nullability). The single output link matches the input schema.
-Type of the funnel should be defined as a “Continuous Funnel” unless there is a valid reason to specify that otherwise. Continuous file type will be having better performance when compared to the other types.
Target Stages
This section will describe the best practices for all the stages that can be used as a source stage in a parallel job

Sequential Files

- If Null values are expected as values for some of the input fields then care should be taken to define the Null Values in the formatting tab

DB2 Stage

Two types of DB2 stages (DB2 Enterprise & DB2 API) can be used to load data into any DB2 tables. The recommendation is to use DB2 Enterprise stage to load data into DB2 tables that reside in the same operating system as that of the ETL server. DB2 API stage should be used to load data into DB2 tables that are in a different operating system. E.g. Mainframe DB2 tables

DB2 Enterprise

- DB2 Enterprise stage will execute in parallel and the partitioning in the input link of the stage should be left as DB2 (default). Overriding this partitioning type will add to the overhead of the database
- Load option can be used in this stage if the requirement is just to insert huge volume of data into the target table. The load should be a non-recoverable load otherwise if the job fails the entire table space will be locked. To make the load as non-recoverable select the option ‘Non-Recoverable Transactions’ as ‘True’
- Upsert option can be used to insert/update data into the target table. Care should be taken in the case of updating a table with referential integrity. In this case the array size and row commit interval should be having the same value
- If the requirement is to commit the records at the completion of load, then the row commit interval should be set as a very high value

DB2 API 

- DB2 API stage runs sequentially and the throughput will be very less when compared to the Enterprise stage (max of 2000 rows/sec)
- The Database used in the DB2 API stage should be cataloged in the default DB2 instance (mentioned in the dsenv file) for the connections to be established properly
- When using dual server APT_CONFIG_FILE (say any *prod01_prod02*.apt) values, the DB2 client installation should be verified in both the servers

Oracle Stage

- If the password has a special character in it, then the password should be enclosed within double quotes.
- The partitioning of the input link of the Oracle stage should be left as “Auto” (default)
- Jobs that are updating/inserting/deleting rows on Oracle tables that contain bitmap indexes with low cardinality tend to create database deadlocks. In such scenarios the bitmap indexes should be changed to a b-tree index

External targets

In case the external target is invoking a C program which is interacting with a file larger than 2 GB; call open64() statement instead of calling open(). The same can be achieved by using the open() API but compiling  with the _LARGE_FILE_API macro defined.

1 comment: