Datastage World: 2014

Tuesday, 2 September 2014

Importing CFDs in Datastage

To import a CFD:

Procedure
1. Open the Import Meta Data (CFD) dialog box in either of these ways:
> Choose Import > Table Definitions > COBOL File Definitions from the main menu.
> Right-click the Table Definitions folder in the repository tree and select Import Table Definition > COBOL File Definitions from the shortcut menu.

2. In the COBOL file description pathname field, type or browse for the path name where the CFD file is located. The CFD file must either reside on the InfoSphere™ DataStage® client workstation or on a network that is visible from the client workstation. The default capture file extension is *.cfd.

3. In the Start position field, specify the starting column where the table description begins (the 01 level). The default start position is 8. You can change the start position to any value from 2 to 80. InfoSphere DataStage allows 64 positions from the start position. For example, if the start position is 2, the last valid column position is 65.

4. In the Column comment association field, specify how to associate comment lines with columns in the CFD file. The default is to associate a comment line with the column that follows it.

5. Select the items to import in the Tables list. This list appears after you specify the CFD path name. Click Refresh to refresh the list if necessary. Select a single table by clicking the table name, or select multiple tables by holding down the Ctrl key and clicking the table names. To select all tables, click Select all.

To see a summary description of a table, select the table name and click Details. The Details of dialog box opens, displaying the table name, description, and column names.

6. In the To folder field, specify the name of the repository folder where you want to save the CFD. The default is Table Definitions\COBOL FD\filename. You can change the folder by typing a different name or browsing for one.

7. Click Import. The data from the CFD file is extracted and parsed. If any syntactical or semantic errors are found, the Import Error dialog box opens, allowing you to view and fix the errors, skip the import of the incorrect item, or stop the import process altogether.

Parent Topic: COBOL File Definitions

COBOL File Definitions in Datastage

COBOL File Definitions contain data description statements in a text file that describe a file format in COBOL terms. You can import CFDs into the InfoSphere™ DataStage® repository directly from a COBOL program. A CFD file can contain multiple table definitions, and can be either a COBOL copybook or a COBOL source program.

Before you import a COBOL FD, be sure it contains valid COBOL syntax. InfoSphere DataStage supports level number 02 to 49 and recognizes the following clauses:

OCCURS
OCCURS DEPENDING ON
PICTURE
REDEFINES
SIGN
SYNCHRONIZED
USAGE
The following items are not captured:

Level numbers 66 and 88 (these become comments for the column)
Data element names that are SQL reserved words (see Reserved Words for a list)
At least one 01 level must be defined in a CFD file. The name at the 01 level becomes the default table name in InfoSphere DataStage. Comments must be designated with an asterisk in the column preceding the start position.

For details about the native data types that InfoSphere DataStage supports when importing CFDs, see Native Data Types.

>>Importing CFDs in Datastage

The Unix Time Command : tips & tricks

If you have a program ./prog.e then in the bash/ksh shell you can type this command and the output on the screen details how long the code took to run:

$ time ./prog.e
real 24m10.951s user 6m2.390s sys 0m15.705s

Real time - Elapsed time from beginning to end of program (or wall clock time).The real time is the total time of execution.
CPU time - Divided into User time and System time.
User time - time used by the program itself and any library subroutines it calls.The user time is the time spent processing at the user/application process level.
System time - time used by the system calls invoked by the program (directly or indirectly).The sys time is the time spent by the system at the system/kernel level.

If the wall clock time is consistently much longer than the total of the system and user time, then the fetching of data to and from hard drives may be taking a good deal of time. In parallel codes, the code may be spending a good deal of time waiting on communication between processors.

By this command you can check your script performance.

Wednesday, 26 February 2014

DataSet in DataStage

Inside a InfoSphere DataStage parallel job, data is moved around in data sets. These carry meta data with them, both column definitions and information about the configuration that was in effect when the data set was created. If for example, you have a stage which limits execution to a subset of available nodes, and the data set was created by a stage using all nodes, InfoSphere DataStage can detect that the data will need repartitioning.

If required, data sets can be landed as persistent data sets, represented by a Data Set stage .This is the most efficient way of moving data between linked jobs. Persistent data sets are stored in a series of files linked by a control file (note that you should not attempt to manipulate these files using UNIX tools such as RM or MV. Always use the tools provided with InfoSphere DataStage).
there are the two groups of Datasets - persistent and virtual.

The first type, persistent Datasets are marked with *.ds extensions, while for second type, virtual datasets *.v extension is reserved. (It's important to mention, that no *.v files might be visible in the Unix file system, as long as they exist only virtually, while inhabiting RAM memory. Extesion *.v itself is characteristic strictly for OSH - the Orchestrate language of scripting).

Further differences are much more significant. Primarily, persistent Datasets are being stored in Unix files using internal Datastage EE format, while virtual Datasets are never stored on disk - they do exist within links, and in EE format, but in RAM memory. Finally, persistent Datasets are readable and rewriteable with the DataSet Stage, and virtual Datasets - might be passed through in memory.

A data set comprises a descriptor file and a number of other files that are added as the data set grows. These files are stored on multiple disks in your system. A data set is organized in terms of partitions and segments.

Each partition of a data set is stored on a single processing node. Each data segment contains all the records written by a single job. So a segment can contain files from many partitions, and a partition has files from many segments.

Firstly, as a single Dataset contains multiple records, it is obvious that all of them must undergo the same processes and modifications. In a word, all of them must go through the same successive stage.
Secondly, it should be expected that different Datasets usually have different schemas, therefore they cannot be treated commonly.

Alias names of Datasets are

1) Orchestrate File
2) Operating System file

And Dataset has multiple files. They are
a) Descriptor File
b) Data File
c) Control file
d) Header Files

In Descriptor File, we can see the Schema details and address of data.
In Data File, we can see the data in Native format.
And Control and Header files resides in Operating System.

Starting a Dataset Manager

Choose Tools ► Data Set Management, a Browse Files dialog box appears:
Navigate to the directory containing the data set you want to manage. By convention, data set files have the suffix .ds.
Select the data set you want to manage and click OK. The Data Set Viewer appears. From here you can copy or delete the chosen data set. You can also view its schema (column definitions) or the data it contains.

Tuesday, 2 September 2014

Importing CFDs in Datastage

COBOL File Definitions in Datastage

The Unix Time Command : tips & tricks

Wednesday, 26 February 2014

DataSet in DataStage

Total Pageviews

About Me