From MIKE2 Methodology
ETL Architecture Standards provide general guidelines on capabilties that must be offered as part of the solution to meet common problems faced when conducting data integration. ETL Architecture Standards complement Data Integration High Level Solution Architecture Options.
Although these standards are listed as taking place in ETL Physical Design, it is ideal that they be done before the prototype if possible. Once they are established once, they should be able to be re-used for future increments and only need to be reviewed.
Key Standards to Establish for the ETL Architecture
Design of Rollback and Recovery Procedures
Rollback and Recovery Procedures define the strategy for handling load failures. This includes recommendations on whether milestone points and staging are required for restarts.
The concept of rollback and recovery of ETL processes needs to be considered during the preliminary design stage as it affects all ETL jobs. For long ETL processing blocks such as a data warehouse load the end to end process may take hours or even days to run. It is important to have built in restartability to recover from fatal errors during the processing cycle.
Milestone Recovery
The simplest form of recovery is to introduce vertical bands into the ETL processing cycle. A vertical band is a set of jobs that run together and when complete they arrive at a milestone point. If the following block of jobs fail it can be restarted from this milestone.
A milestone point requires some type of data staging, with the data being placed in temporary files or tables where they can be extracted by the next block.
The best practice for complex data loads is to have 3 vertical bands: Data Sourcing, Data Mediation & Quality/Transformation and Data Load.
- Data Sourcing involves retrieving the data from sources and delivering it to files or tables with some type of time stamping to allow for time based processing. This data undergoes as little transformation as possible and once delivered it is stable. This milestone point means the Data Transformation block can be started and restarted without having the source data affected by user transactions.
- Data Mediation & Quality/Transformation covers a large area of data conversion, cleansing, enrichment, aggregation, etc. This step benefits from having ETL patterns that describe common transformations as shown in the sections below.
- Data Load involves delivering the final data to the target database. This band holds as little of the transformation logic as possible. It is focussed on achieving a robust database update by controlling transaction sizing and trapping database rejects. Database updates are the most volatile part of the process due to the complexity of RDBMS communications and the difficulty most ETL engines have with correctly rolling back and restarting a failed update.
This approach also applies to Services Oriented Architecture implementations of the integration solution.
Individual Job Recovery
In many instances it is possible to recover from a fatal error by restarting the job that failed and continuing the ETL cycle from that point. Many ETL and scheduling tools provide the functionality to automate this process. It may be necessary for production support to first investigate the problem and fix it before the automated recovery begins. This is a short cut to restarting from previous milestone points.
It is usually easy to do job recovery on the Sourcing and Transformation bands as these typically stage the data to temporary files or tables. Restarting an individual job recreates the target output of these jobs without rollback problems. In these cases the ETL scheduling tool can be used to restart the sequence from the correct point.
The Data Load band is the most difficult for rollback and recovery as a job may fail in the process of updating a database. If the update is an insert, or an update to an aggregated table then it is difficult to determine how many rows of stage data have already been processed. A simple job restart may result in duplicate rows or duplicate increases to aggregate results. For a Data Load job it may be possible to restart the job or it may be necessary to build full table rollback into a job restart. It is worth considering enhancing the design to assist with rollback – for example, a batch number could be added to transaction tables to facilitate deletion of partial or erroneous insertions.
ETL Security Design
ETL Security Design defines security provisions which affect design. The ETL security approach needs to take into account the security of the environments and the security of the data being processed.
Environment Security
This section encompasses users, groups and passwords. The security scheme of the ETL tool must be examined and it should be noted that some tools by default use the security of the ETL server operating system for user IDs and passwords.
- If the client machines are on the same domain as the server it is possible for users to omit user ID and password details and pass through the client user ID details to the server for verification. This is more standard for Windows client machines running against Windows ETL servers.
- For Unix, Linux and Mainframe ETL servers the groups and user IDs on the server are likely to be maintained separately from the client domain.
The environment security approach needs to define whether ETL developers and support staff use individually maintained user IDs on the server or whether generic user IDs are used. Generic user IDs are easier to maintain but have a lower level of security.
In Production environments, an ETL batch user ID is required for execution of jobs. This must be investigated early in the ETL design process, since some clients have never had such a concept and it requires education of and negotiation with the client\’s security department.
Database Security
ETL development often has databases as source and targets. Where these databases exist on the ETL server, the database security can be set to the security of the operating system. The security groups on the ETL server can be granted permissions to the databases.
More often, the database is remote, and a special batch user ID and password is required to access the database. The password for this user needs to be protected and must not appear in ETL logs or environment scripts in plain English. The password must appear as encrypted text in password entry screens and password storage and reporting. Typically the database connection details are set by a parameter processing routine which reads a highly protected file to obtain them.
Data Security
The ETL server is used to move large volumes of data from source to target. Most ETL tools provide browse data facilities for development and debugging. This means there is very little inherent data level security on the ETL server – the server itself needs to be kept secure.
If an ETL server has security on all files landed to disk, and row and column level security on all databases, it still cannot be assumed the data is secure. ETL jobs have a powerful range of data manipulation and delivery functions and it must be assumed that any ETL developer can browse and view any data being processed on the server, regardless of the level of security on the sources and targets.
Relationship to Overall Implementation Guide
The following tasks from the Overall Implementation Guide act as input to the definiton of ETL Environment Standards:
The output of this step is then a set of updated ETL Architecture Standards