Open Framework, Information Management Strategy & Collaborative Governance | Data & Social Methodology - MIKE2.0 Methodology
Wiki Home
Collapse Expand Close

Members
Collapse Expand Close

To join, please contact us.

Improve MIKE 2.0
Collapse Expand Close
Need somewhere to start? How about the most wanted pages; or the pages we know need more work; or even the stub that somebody else has started, but hasn't been able to finish. Or create a ticket for any issues you have found.

ETL Best Practices Deliverable Template

From MIKE2.0 Methodology

Share/Save/Bookmark
Jump to: navigation, search
Under construction.png
This article is currently Under Construction. It is undergoing major changes as it is in the early stages of development. Users should help contribute to this article to get it to the point where is ready for a Peer Review.
This deliverable template is used to describe a sample of the MIKE2.0 Methodology (typically at a task level). More templates are now being added to MIKE2.0 as this has been a frequently requested aspect of the methodology. Contributors are strongly encouraged to assist in this effort.
Deliverable templates are illustrative as opposed to fully representative. Please help add examples to this template that are representative of the proposed output.

When using an off-the-shelf ETL product, principles for software development do not change: we want our code to be reusable, robust, flexible, and manageable. To assist in the development, a set of ETL Best Practices should be created for the implementation to follow. Failure to implement these practices usually result in problems further down the track, such as a higher cost of future development, increased time spent on administration tasks, and problems with reliability.

Listed below are the standards that should be identified on a typical project:

  • Naming Conventions that will be used across the ETL integration environment.
  • Error Handling: proposed standards for error trapping of jobs. This should be at a standards level, with detail of the design covered explicitly in a separate section of the physical design.
  • Process Reporting: status and row counts of jobs should be retrieved for accurate process reporting.
  • Notification: the manner in which information about successful and unsuccessful runs is delivered to the administrator and relevant stakeholders.
  • Parameter Management: the ability to manage job parameters across environments so that components can be delivered and run without requiring any modifications.
  • Optimisation: Standards for improving performance such as parallelism or hash files. The more detailed design aspects of this approach is a separate section of the physical design.
  • Reusability: Standards around simplified design and use of shared components.
  • Metadata Management: Standards around metadata management as they apply to the ETL design

Contents

Examples

Example for ETL Best Practices for a vendor tool

DataStage Best Practices

Overview

Like any programming tool DataStage jobs must follow the principles of the SDLC in the delivery of components. To assist in the development, management and support of the DataStage environment, there are a set of best practices identified by Ascential and the DataStage community that should be implemented. These best practices are additional to a standard DataStage install. Failure to implement these practices usually result in problems further down the track, such as a higher cost of future development, increased time spent on administration tasks, and problems with reliability.

Best Practices

The following best practices have been identified for Phase 1 of the Data Quality project:

  • Naming Conventions: will be in place from the start of the development.
  • Folder Management: DataStage components will be organised into a set of folders for each type of extract and a subset of folders for each type of job.
  • Failover and Recovery: because the loads are not large the strategy in this area will be to restart the extract from the beginning. The standard best practices of milestone points and staging will not be required.
  • Error Handling: will target 100% error trapping of jobs and components called by DataStage.
  • Release Management: will use the DataStage Version Control tool to provide a process to deliver read only components to testing and production. Components will only be moved to production after first passing through the System Test environment.
  • Process Reporting: status and row counts of jobs will be retrieved for accurate process reporting.
  • Notification: Email will be the primary notification vehicle and it will be integrated with the error handling and process reporting routines to ensure information about successful and unsuccessful runs are delivered to the administrator and stake holders.
  • Parameter Management: the ability to manage job parameters across environments so that components can be delivered and run without requiring any modifications.
  • Optimisation: Instantiation will be built in where data is written out to staging files through each part of the process. Multiple instances will not be required in phase 1 due to the small size of the files, instead parallelism will be used when different types of extracts run at the same time. Standard optimisation techniques such as hash file lookups will be in place.
  • Simple Design: each job will have up to six stages and be designed to carry out one part of the processing.
  • Metadata Management: DataStage will be the central repository of metadata information on source and final output files. File definitions will be setup using the standard DataStage import table definition functions. Metadata on the addresses processed by QualityStage will be managed by the QualityStage repository and imported into DataStage. Version numbering of metadata will be in place in DataStage to map changes to metadata against jobs affected.
  • Best of breed processing: DataStage will handle those tasks best suited to ETL while QualityStage will handle all validation and standardisation tasks. This will result in a solution which is inherently easier to maintain and support.

Naming Conventions

DataStage components

  • All jobs will include the two-character code "DQ" to indicate they belong to the data quality project.
  • Sequence jobs will be prefixed by "Seq:"
  • Batch jobs will be prefixed by "Batch:"
  • All jobs will have a two-digit code identifying which type of extract they belong to, i.e. BP, CP, or DM.
  • After all prefixes the job name will be a description of the primary job function. Descriptions will be standard across extract types.
  • Routines will have the prefix DQ indicating they were written for the data quality project. The rest of the routine name will describe the primary routine function to make it easy to identify in a transformer.
  • Saved Files will retain the same name they had in the source system.

Job Stages

A prefix will be used on all stage names. This helps identify the stage when metadata about the jobs are extracted into a customer report:

  • Transformer – xfm
  • Sequential file stage – seq
  • Hash file stage – hsh
  • Complex flat file stage – cff

The rest of the stage name will describe the primary function of the stage.

Files Names

Source files are given names by the processes or Data Analysts who create them. These names are expected to be unique. DataStage will retain these names throughout the processing cycle and add extra identifiers to the names as the file passes through each process stage, as follows (assuming the input file is named customers010108.dat):

Filename Contents
Customers010108.dat The original source file; name must be unique.
Init_customers010108.dat The original file content is updated with a unique identifier for each record; the original filename is prepended with the init_ prefix.
Messages_customers010108.dat The log file that receives all the trapped error and warning messages from QualityStage, DataStage, and Unix scripts.
Clean_addresses.hsh A file of clean addresses in hash file format to allow lookup.
Rejects_customers010108.dat Addresses that are rejected by DataStage due to validation or transformation problems.
Apid_customers010108.dat The text file created for the APID application.
Apid_stat_customers010108.dat The APID statistics file showing APID success rates and error rates.
Log_Apid_customers010108.dat The log file produced by the APID process.
Err_Apid_customers010108.dat The error rows created by the APID process.
Apid_matches.hsh The APID output loaded into a hash file for lookups.
Out_customers010108.dat The output from the APID application containing unique IDs and matching APID codes.
DQ_customers010108.dat The final file containing clean addresses and APIDs.
Failed_APID_customers010108.dat.txt The report containing addresses without a APID.
Before_After_customers010108.dat.xml The before / after report containing all clean and original address information.
Mail.msg Temporary file that contains the text of an email notification.
Attach.txt Temporary file containing all the text for an email file attachment.

QualityStage components

  • All jobs will include the two-character code (CP, BL, or DM) indicating what source system they belong to. CP-Customer Correspondence, BL-billing file , DM-Direct Marketing
  • The three Routines within QS include Investigation, Standardization and Matching beginning with each routine will be prefix with INV for investigation processing, STN for standardization processing and MTC for matching processing. There is a detail description field for each routine to make it easy to identify a routine.
  • Saved Files will retain the same name they had in the source system.

Job Stages

  • A prefix will be used on all stage names. This helps identify the stage when metadata about the jobs are extracted into a customer report.
  • Investigation – INV
  • Standardization – STN
  • Matching – MTC
  • The rest of the stage name will describe the primary function of the stage.

Example of the naming convention used in the Investigation routine

Customer-id Description
INVTCUS Investigate Customer-id (T) pattern
INVCCUS Investigate Customer-id (C) content
INVDCUS Investigate Customer-id (D) duplicate
Customer Name Description
INVWNAM Investigation Name (w) word
Street Address Description
INVWAD1 Investigation Address1 (w) word
Postcode Address Description
INVWAD2 Investigation Address2 (w) word
Locality Address Description
INVWAD3 Investigation Address3 (w) word
State Address Description
INVTSTA Investigate State (T) patttern
INVCSTA Investigate State (C) content
INVDSTA Investigate State (D) duplicate

Files Names

Source files are given names by the Data Analysts or automated processes that create them. These names are unique. QualityStage will retain these names throughout the processing cycle unless requested to change these names during testing.

Examples of naming standards used in the Standardization and Matching routines:

QualityStage Standardization and Matching Naming Standard

Customer Correspond file QualityStage Naming Standards
CP10030A Standardization file
CP10040 Handled Standardization file
CP10050 Unhandled Standardization file (Exception Report)
CP1MTCHI CP Individual Match file
CP1MTCHI.RPT CP Individual Match report
CP1MTCHO CP Organisational Match file
CP1MTCHO.RPT CP Organisational Match report
Billing File QualityStage Naming Standards
BL0003A Standardization file
BL0003C Handled Standardization file
BL0003E Unhandled Standardization file (Exception Report)
BLMTCHI BL Individual Match file
BLMTCHI.RPT BL Individual Match report
BLMTCHO BL Organisational Match file
BLMTCHO.RPT BL Organisational Match report file
Direct Marketing file QualityStage Naming Standards
DM00030A Standardization file
DM00088A Handled Standardization file (with DM business rules)
DM00050 Unhandled Standardization file (Exception Report)
DMMTCHI DM Individual Match file
DMMTCHI.RPT DM Individual Match report
DMMTCHO DM Organisational Match file
DMMTCHO.RPT DM Organisational Match report

Error Handling

The target is 100% error handling.

Sequence job error handling

  • An exception handling stage will exist in each sequence job to trap all unexpected errors, this stage will trigger notification.
  • Check the return code of server jobs called and trap any non-success statuses.
  • Server jobs will be checked for reject records.
  • Check the return code of unix scripts. Unix scripts will follow a template where any errors will return a non-zero code to DataStage to flag an error. They will also echo success and failure messages back to the DataStage log for inclusion in notification.

Server job error handling

  • Errors in the opening and creation of files will be trapped by the calling sequence job.
  • Errors in data will be trapped by a reject link from each transform writing the record out to a reject file.
  • Minimal transformation and validation will be carried out in server jobs, most of the data validation will be performed in QualityStage thereby reducing the risk of DataStage errors.

Release Management

The DataStage Version Control tool is used to deliver components to testing and production. This tool handles all components in a DataStage project and any Unix scripts or applications placed in custom folders under the project directory.

  • Components for delivery are collected into a numbered release.
  • The Version Control tool attaches to the development project and required components are imported into the version control project as a release batch.
  • The batch is then promoted to testing as read only components. The tool will compile the jobs as they are delivered.
  • Parameters are then updated to reflect the paths on the new system.

Notification

DatatStage will be responsible for all communication with support and users. The main form of communication will be emails sent from the xmail application on the DataStage server. The following files are collected and written out in an email message:

  • Errors_filename : All error messages logged by DataStage server and sequence jobs. If a job aborts without a trappable error message then a message will be created for the log and this file.
  • Warnings_filename : Contains up to 50 warning messages from each job. If a job has more than 50 warnings it aborts. The full text of the warning is retrieved to make trouble shooting easier.
  • Rejects_filename : This file is sent as an attachment. Any job that has a complex transformer will have a reject link, all these reject links append rows to this file. Since it is of unknown length and width it is sent as a tab delimited attachment that can be opened in Excel.
  • Apid_stats_filename : This file contains APID statistics on the number of rows processed and the success of APID processing. Since this information is also available in the row counts this file is ignored.
  • Links_filename : This file contains important link row counts such as the rows at the beginning of processing, the rows passing APID, rejected row counts and the final output row counts.

MetaData Management

Source file definitions are imported into DataStage via the import function. Imports of flat files will use the headings already appearing in the first row of those files. Imports of COBOL files will use the COBOL definition files provided by DIT.

Cobol definition workaround

The COBOL definition files provided by DIT needed the following amendments so they could be imported and used by DataStage:

  • A 01 entry was added to the start of the file as DataStage requires this numbering to identify a table.
  • All comments were removed as the import function was failing to parse these rows.
  • All 88 commands were removed as these commands are intended for report formatting and altered the raw data when they were parsed.
  • A Unique Id field was added to the start of each row.

The definition files with the above changes were then used to create the table definitions in DataStage, defining the layout of the Billing and Correspondence files.

When a file definition is first imported it is given a version number of v_1 to indicate this is the first version. It retains this number until it reaches testing. If a table definition needs to be changed, for example the Direct Marketing file receives a new address field for validation, the format is imported and number as v_2. Jobs that use this file load up that field using the v_ definition. It is then possible to extract a metadata report from the Reporting tool that shows all jobs using the old table that need to be updated to the new format.

Optimisation

To ensure the shortest possible delivery time of processed files the DataStage controlling sequence job will run in continuous mode waiting for the arrival of files in the Landing directory. This caters for files arriving on a schedule and files delivered ad hoc by Data Analysts.

Parallel processing will be possible by running more than one type of extract at the same time. For example a billing extract can run at the same time as a direct marketing extract in the event that the files arrive around the same time. The use of multiple instance jobs, where more than one instance of a billing job can run at the same time, was investigated and deemed an unnecessary development overhead for phase 1 given the small size of the files.

The following guidelines are used to optimise each job:

  • First transform reduces breadth. Where possible the first transform in any job will only carry through those columns that are needed later in the job. Any columns that are not required are ignored.
  • First transform reduces depth. Where a subset of rows need to be processed the first transform in a job, or a complex flat file stage, will attempt to filter out those rows that do not need to be processed to increase the performance of the rest of the job.
  • Unique Id will be used for all look-ups. A single unique id is the fastest way to look-up and match different sets of data.
Wiki Contributors
Collapse Expand Close