From MIKE2.0 Methodology
Title page
ETL Processing - Software Architecture Document
Introduction
The document provides a comprehensive overview of the software architecture components supporting the ETL processing within the Client Hub environment.
Extract, transform and load (ETL) is a major component within the Client Hub that is responsible for extracting data from source systems, transforming the data into required formats and loading the transformed data into the databases within the Client Hub environment.
Purpose
This document provides an architectural overview of the ETL processing system, using a number of different use cases and architectural views to depict different aspects of the ETL processing within the Client Hub. It is intended to capture and convey the significant architectural decisions which have been made on the system.
Scope
The document focuses on the ETL software components that process and transform data from the source systems and load the processed data into a database of different stages within the Client Hub.
< Examples of different database stages could be
- Pre-landing Database
- Landing Database
- Staging Database
- Target Hub Database
>
Definitions, Acronyms and Abbreviations
Please refer to the Glossary section in the Appendix, which contains a list of the main definitions, acronyms and abbreviations used in this document.
References
The documents referenced in this document can be found in the Appendix A – References section. Please note that some of the referenced documents are still in draft version.
Overview
The document begins with a high-level architectural representation of the overall ETL process and subsequently provides a lower-level detail of each architecture component.
Architectural Representation
The high-level ETL processing architecture shows major system and software components that involve in the movement of source data into the Client Hub. The system components include the source systems that provide data for the Client Hub and the Client Hub which is the target data destination of the ETL processing. The software components, in contrast, are software programs that are used for extracting, transforming and loading of data into the databases within the Client Hub environment. The software components comprise of the source extract programs, ETL processes as well as other supporting software components that complement the ETL processes.
System Components
- Source Systems:
- Client Hub
’ s Databases:
- External Data Providers (if any for data enrichment)
Software Components
- Extract Program: An extract program residing on the source system that is responsible for extracting and providing source data to the Client Hub
- ETL: ETL processes that involve in the data extraction, transformation and loading of data into a database within the Client Hub
- Supporting ETL: Software programs that are auxiliary to the main ETL processes. They are responsible for logging execution status of ETL jobs, processing of exception handling rules and capturing of exception data.
Architectural Goals and Constraints
List the key high level architectural goals and constraints related to the ETL processes, the databases that will be involved etc.
Use-Case View
Examples of the use cases within the client hub
- Receiving data
- Persisting data
- Unique key generation
- Log exceptions
Business Scenarios (Examples)
There are two primary business scenarios related to the MDM Repository ETL Processing:
Initial Load Scenario
This business scenario takes place when the data store within the Client Hub is to be initially loaded with data from the source systems. The scenario may also occur when the database within the Client Hub has to be reloaded if the data is wiped out or if the database has been re-instantiated.
Delta (Incremental) Load Scenario
The Delta Load scenario entails the ETL processing of delta data from the source systems that normally occurs at a regular interval (daily, weekly or monthly). The delta data from the source system is defined as new data records, data records that have been updated or data records that have been deleted in the source systems. Delta data as a result of a change made to the source systems since the previous occurrence of delta data extraction will be processed by the Delta Load ETL Processing scenario.
Sequence Diagram
Sequence diagram illustrating the sequence of use case interactions and system boundaries where a use case is residing.
Functionality
This section describes functionalities of the ETL processing software component. Some of the examples are
Sample functionality is described below
Receive Source Data Functionality
Support for Extracting Required Source Data Elements from Original Extract
Support for Ensuring that No Files are Overwritten Unintentionally
Support for Verifying Completeness of Received Extract Files
Persisting Source Data in the Pre-Landing Database Functionality
Support for persisting or storing the source data records in the Pre-Landing Database
Load Business Data Elements to Pre-Landing Database
Load Technology Data Elements to Pre-Landing Database
Maintain Original Source Data Content
Maintain Original Source Data Format
Support for Tracing Staging Data to its Original in Source Extract
Persisting Source Data in the Landing Database Functionality
Support for persisting or storing the source data records in the Landing Database
Load Business Data Elements to Landing Database
Load Technology Data Elements to Landing Database
Maintain Original Source Data Content
Maintain Original Source Data Format
Support for Tracing Staging Data to its Original in Source Extract
Persist Data in Staging Database Functionality
====Transform Source Data to Staging Database
’ s Required Formats====
Translation of Source Codes to Standardized Codes
Support for Surrogate Key Generation
Logical View
This section should provide a logical view of the system and the database components
System Components
Typically the various landing and staging areas, source systems etc.
Examples: Source Systems: The in-scope source systems.
- Pre-Landing Area: The Pre-Landing Area will store source data as it comes in an extract file. The Pre-Landing Area will be used for storing data from a source system that may not provide extract files in the Client Hub
’ s defined formats.
- Landing Area: The Landing Area will only store Client Hub
’ s required data elements from the source system.
- Staging Area: The Staging Area is a database that is modeled after the hub Conceptual Data Model, which is client-focused and independent of the choice of MDM product that will be deployed in the Target Hub. At this stage, the same data domain (i.e., person, business, account, etc.) from a different source system will be persisted in the same table in the Staging database.
- Target Hub: The Target Hub in logical view is the database of a chosen MDM product.
- Cross Reference: The Cross Reference system component is a database component for storing cross references profile data residing in the source system to the profile data that is stored in the MDM repository. The key data element in the Cross Reference is the source system key of profile data in the source system and the unqiue ID generated by a Client Hub process, in the Client Hub.
Software Components
The software components typically include the major ETL processes: Examples are
- Flow 1A: Source-to-Pre-Landing ETL
The Source-to-Pre-Landing ETL is a data movement process that reads source data files and loads the data into the Pre-Landing database. There should be no transformation on the raw data during this process. However, due to the record structure and the format of the data from certain system (e.g., mainframe) that may have sub records or repeating values and the source data types that may not be supported by the RDBMS in the distributed environment, the Source-to-Pre-Landing ETL may apply data transformation rules to the extent that the source data could be loaded into the Pre-Landing database.
- Flow 1B: Pre-Landing-to-Landing ETL
The Pre-Landing-to-Landing ETL is a data movement process that performs column filtering rules to remove unnecessary data fields from the Pre-Landing database and load only required data fields into the Landing database. Note that in this ETL software component, another filtering rule based on a list of branch numbers may be applied to facilitate the testing allowing the test team to look at smaller, more manageable volume of data.
- Flow 2: Landing-to-Staging ETL
The Landing-to-Staging ETL is a data movement process that performs extensive data transformation for populating the database in the Staging Area. It reads source data that is residing in the Landing Area
’ s database. The following classifications of data transformations occur in the Landing-to-Staging ETL:
- Validations: Validate source data for invalid data content or formats.
- Business Rules: Apply business rules on source data.
- Transformation/Translations: Apply data transformation rules on source data or translate source codes, types or indicators into standardized values.
- Enrichment: Extend, enhance or improve data with data quality rules.
- Correlation: Apply a defined logic for correlating pieces of data to establish a relationship between data.
- Flow 4: ETL to support interfaces to the external data providers
Process View
The process view of the ETL processing architecture reveals ETL processes that support the initial as well as the delta ETL processing requirements.
Example: Source-to-Landing ETL
The logical view of the Source-to-Landing ETL assumes that the source data files from source systems contain data content and formats that meet the requirements of the Client Hub. In the process view of the Flow-1 or the Source-to-Landing ETL, however, reveals the fact that the requirements may not be met by certain source systems. For example, the files may have different file formats or contain many more data fields than needed by the Client Hub.
In such case, the Source-to-Landing ETL can be broken down into two ETL processes:
- Flow 1A: Source-to-Pre-Landing ETL Process
- Flow 1B: Pre-Landing-to-Landing ETL Process
Major Processing Steps:
- FTP source data files from the source system to a designated file system area on the ETL server.
- Apply transformation rules to create load-ready files containing unfiltered data into Pre-Landing Area database. In this step, transformation rules are minimal and may include adding audit fields.
- FTP the load-ready files from the ETL server to a designated file system area on the database server where the Pre-Landing Database resides.
- Load data in the load-ready files into the Pre-Landing Database.
Deployment View
This deployment view of the ETL processing architecture provides a level of details as to how ETL architecture components reside in the source systems, the ETL environment as well as in the Client Hub environment.
Sample ETL Processing Deployment View
Implementation View
Overview (Example)
Major Processing Steps and Design Specifications:
| Step
| Description
| Design & Program Spec
|
| 1A.1
|
|
|
|
|
|
FTP Source Data Files to ETL Server
- Program Logic Overview
- Program Specifications
| Program
|
|
| Name
|
|
| Overview
|
|
| Dependencies
|
|
| Parameters
|
|
| Program Steps
|
|
| Inputs
|
|
| Outputs
|
|
| Lookup/Hash
|
|
| Before Logic
|
|
| After Logic
|
|
| Related Documents
|
|
The following exception-handling rules defined for this program.
|
#
| Exception Conditions/Rules
| Handling Rules
| Exception Code*
| Exception Information to be Logged
|
|
|
|
|
|
|
Job Sequence and Operational View
Job Sequence Diagram of Source-to-Pre-Landing ETL Process (illustrative Example)
Data View
The section provides a detail of each data architectural components that are relevant to the ETL Processing
Examples : Source file names; details of the relevant data models etc.
| Source System File Name (Logical)
| Description
| Source System Physical File Name (DSN)
|
|
|
|
|
|
|
|
|
|
|
|
|
Size and Performance
Appendix A – References (sample)
Standards, Guidelines and Best Practices
MDM Software Requirements Specification (SRS)
Identification Management SRS
Client Hub Services
Client Hub Infrastructure
Data Model
Source Extract Layouts
Data Mappings
Deployment Information of Software Components
The section is for informational purposes only.
ETL Environment
| Component Name
| Development
| QA
| Acceptance
| Production
|
| Software Components
|
|
|
|
|
|
|
|
|
|
|
| Files
|
|
|
|
|
|
|
|
|
|
|
Services Environment
| Component Name
| Development
| QA
| Acceptance
| Production
|
| Software Components
|
|
|
|
|
| Message Queue
|
|
|
|
|
Database Environment
| Component Name
| Development
| QA
| Acceptance
| Production
|
| Database Components
|
|
|
|
|
| Files
|
|
|
|
|
Key Trigger File Locations
| Key Name
| Staging Field Name
| Location
|
|
|
|
|
Exception Classification and Exception Codes (example)
| Category
| Exception Name
| Code
|
| Process
| Authorization Exception
| PRC-01
|
| Data
| Data Audit Exception
| DAT-01
|