Personal tools

Partners

ETL Design Techniques for Incremental Capture

From MIKE2 Methodology

Jump to: navigation, search

ETL Design Techniques for Incremental Capture describe logical techniques for dealing with incremental updates that may be needed as part of the ETL publication of information. Some of the most common options include:

  • Application-Assisted Capture
  • Database Triggers
  • Database Log Scraping
  • Timestamp-Based Capture
  • File Comparison

A high level description of these techniques is shown as follows.

Contents

Application-Assisted Capture

Application-assisted capture calls for the source application to provide logic to detect modifications of interest to the ETL and provide incremental extracts of the modifications. This option usually requires modification to the application, since most applications were not designed to provide incremental feeds to an external system. Source applications which have messaging interfaces or an API for integration fall into this category.

Advantages

  • Provides greatest flexibility and control of the data capture
  • Possibly feasible with new application

Disadvantages

  • Usually extremely complex (and therefore too expensive) to retrofit into a legacy application

Database Triggers

Database triggers implemented in the DBMS of the source application can provide incremental feeds without having to change the application. This can be an effective option if performance of the application is not impacted significantly. It is technically simpler than log scraping (next option).

Advantages

  • Provides complete record of database modifications
  • Does not require changing the source application

Disadvantages

  • Can only be used if the source application uses a DBMS which supports triggers
  • Can be expensive and time-consuming to implement if the underlying table structure of the source system is complex
  • Can negatively impact the source application\’s performance
  • Unpopular with DBAs responsible for OLTP applications
  • Introduces dependency on the DBMS vendor

Database Log Scraping

Oracle, DB2 and other DBMSs provide log scraping options for collecting incremental updates. It provides functionality comparable to triggers without impacting the performance of an OLTP source application. This is frequently the best option if it is available and if the required DBA skill and resources are available and are willing to be allocated by the client.

Advantages

  • Provides complete record of database modifications
  • Does not require changing the source application
  • Minimally impacts the OLTP application\’s performance

Disadvantages

  • Can only be used if the source application uses a DBMS which supports log scraping or if a third-party tool provides the functionality
  • Can be expensive and time-consuming to implement if the underlying table structure of the source system is complex
  • Requires significant DBA skill and resources to analyse the technical requirements, to correctly implement the log scraping and to provide ongoing support
  • May introduce dependency on the DBMS vendor

Timestamp-Based Capture

Capturing data by comparing an application-maintained update timestamp to the timestamp of the previous capture is a method for collecting data which has changed since the previous capture. There is an obvious dependency upon the reliability of the application\’s timestamps. In many cases timestamps will exist in the tables of an application, but they will not be updated at all, or may be updated only in certain cases and therefore may not be suitable for incremental capture. Also, if the application updates the database in stages, incomplete data might be captured. In a rarer case, an application which deletes data will mask the delete from this capture option, along with any modifications performed since the previous capture.

Advantages

  • Provides simplicity of design of incremental capture

Disdvantages

  • Depends upon the reliability and suitability of the application\’s timestamps
  • Could be expensive if the application and its database must be modified to provide suitable timestamps
  • May capture incomplete data if the application updates its database in stages
  • May miss data modifications if the application physically deletes data

File Comparison

If none of the above capture options can be implemented, then the fall-back option is file comparison. That is, the ETL must perform full extracts, and must save the full extract from the previous capture. A comparison between the current and previous full extracts then provides the set of changes. For large sets of data, this implies a great deal of processing. Fortunately, modern ETL tools provide parallel-processing options which bring this option into the realm of possibility.

This option used to require significant ETL design. ETL tools with built-in file comparison functionality often reduce this now. For example, DataStage Enterprise Edition (formerly Parallel Extender) provides a stage which performs just this sort of file comparison with minimal coding. The amount of processing required is still significant, but at least the design and development time for the ETL is reduced.

Advantages

  • Provides incremental capture where no other option is feasible
  • Design can be simple if the ETL tool provides file comparison functionality

Disadvantages

  • Requires a great deal of processing for large sets of data

Could be infeasible if the SLA for update latency is demanding

Powered by omCollab