Data Integration Logical Design
From MIKE2.0 Methodology
(Redirected from ETL Logical Design)
-> You are here: Data Integration Logical Design
Activity: ETL Logical Design
Objective. The ETL Logical Design provides the key design approach for integrating data from the source to target environment. It is technology independent and is complemented by the ETL Physical Design which defines the implementation using a vendor integration technology.
The terms ETL is used in a logical sense in this design approach, the tasks below could also be applied to middleware that is more focused on real-time integration.
As ETL technologies advance and continue to move to more model-driven development, more of the tasks below will be done within a software tool. They will also more directly involve interactions with Data Profiling, Data Re-Engineering and Metadata Management tools. Many vendors have already moved in this direction. Many clients, however, are still more comfortable with more traditional forms of design documentation (MS Word, Excel) so the team may still need to employ more traditional design strategies. Regardless of the approach, the tasks below generally define the key areas of the design process.
The term "Metadata Repository" is used in this document to mean the collection of metadata as it pertains to ETL. Historically, metadata was captured in spreadsheets and word documents. Ideally, design artefacts will would be stored in an metadata repository that is part of an integration suite or a centralised hub.
Many of these tasks can be performed in parallel – they are more representative of the key aspects to be covered in the logical ETL design than a stepwise list of tasks to be performed. Logical design may be broken up into several documents (often by information subject area) or ideally in a centralised design repository.
Some of the tasks related to ETL Logical and Physical Design referenced a leading guide on this subject  that can be used in a complementary fashion with the MIKE2 Methodology.
The ETL Logical Design will oftentimes be in a word document form, and may be delivered in multiple documents on a per-subject area basis. The recommended approach is to try and move away from static documents and into the use of more active metadata repositories for holding design assets.
Task: Prepare for ETL Design
Task: Define Overall Process Flow
Objective: The Overall Process Flow provides the high-level view of the workflow process that is used to provide integration of data between systems. The visual process flow output of this task gives users a quick description of the flow of information that occurs. This is generally a very simple process flow for data integration scenarios such as Data Warehouses or Data Migrations. The process flow shouldn’t be focused on showing an end-to-end flow; it should show a source-to-target flow. In a Data Warehouse that uses an ODS architecture that feeds into a set of de-normalised Data Marts, for example, the ETL process to load into the ODS and that which loads into the Data Mart will be broken up into 2 separate process flows.
The Overall Process Flow is ideally built a software development tool as opposed to a static document.
Task: Define Producer Acquisition Processes
Objective: In this task, the logical design is specified in relation to extracting data from source systems. The steps within this task:
Step: Define Staging Area
Objective: The initial steps are taken for defining a staging area, which may be needed for loading data extracts.
Step: Determine Producer Extract
Objective: Each source is described, including its technology for information storage and integration. For relational database sources, it is fairly simple to describe the database tables. For other types of sources, there needs to be a complete description of the sources and/or references to other documents which contain this information.
Step: Determine Producer Event
Objective: This step defines the triggers that initiate extraction of information from source systems. It links directly into the overall process flow. Data Acquisition events may be time or event-based; in most data integration implementations (e.g. migrations, warehouses) data acquisition is still done in a batch-oriented fashion although there continues to be more of a move to event-oriented integration.
For many data integration projects, data is typically moved into a staging area immediately after acquisition. Loading data into staging avoids the need for source systems to be accessed multiple times, minimises the number of extracts required from data sources, provides a single version of data for multiple uses, and ensures consistency of data to the target environment.
Step: Determine Extract Frequencies
Objective: As well the event that initiates the extract, the extract frequencies must all be defined. The Business Requirements should state the expected frequency of extracts from each source; using a BusinessTime model may help provide an easy way illustrate frequencies of all loads across the environment.
Step: Define Incremental Data Capture Options
Objective: If incremental updates are required, capture options for its acquisition are investigated and at least a tentative selection is made. Design is performed in the Physical Design phase.
Step: Define Producer Integration Technologies
Objective: Integration logic applies to the manner of interfacing into the source or target environments. In many data integration environments, this is reasonably straightforward as extracts and load processes will use SQL and will be batch-oriented. For some legacy systems without a relational data store, SQL is not an integration option; for some projects involving integration to application data stores, integration logic may involve adapter integration or an application API. Therefore, this task varies in complexity in relation to the integration points.
The need to use synchronous or asynchronous interfaces is also defined at this stage.
Task: Define Data Mapping of Producer to Consumer
Objective: The purpose of this task is to develop a detailed mapping of the producer system data elements to the consumer data element structure. This task involves creating a matrix that identifies the necessary source system elements and maps them to the database elements defined for the increment. Data Profiling tools can help define the structures of the source and target environment and ideally the source and target definitions and transformation rules will be stored in a metadata repository as opposed to a spreadsheet-based matrix.
There are 4 steps to this task:
Step: Map Producer Data Elements to Consumer Data Elements
Objective: Source data is mapped to target data. Some data may be mappable 1:1. Some producer data elements may need to be combined to create a consumer data element.
In a simple project sources could be loaded directly to targets. In a complex project, there could be multiple mapping segments, for example:
Ideally, the mapping would go into a metadata repository as opposed to a static document such as a spreadsheet.
Step: Refine Required Producer System Data
Objective: Often, only selected source data is required to be loaded into the target environment. It is preferable to filter out the unnecessary data during data extraction from the source. The filtering criteria rules are defined in this step.
Step: Define Business Rule Transformations
Objective: Business rule transformations describe changes required for business reasons. These are based upon the content and meaning of the data, not just the format. Profiling tools can be valuable for helping to discover business rules.
Business rule transformations can include the following types of processing:
Business Rules go hand in hand with transformation rules in that they are the defined within the integration layer and implemented in loading from source to the target environment. Whereas transformation rules apply more to mapping of fields, business rules generally refer to procedures that should be followed based on the content of the data. Like with transformation rules, profiling tools can be valuable for helping to discover business rules and these business rules should ideally be stored in a metadata repository.
Step: Define Technical Transformations
Objective: Technical transformations describe any changes required for technical reasons rather than business reasons, such as formatting, data lengths, standardisation, etc. Profiling tools can be valuable for helping to discover data value lists, quality issues, etc.
Technical transformations can include the following types of processing:
Task: Define Data Mediation and Quality Processes
Objective: This step provides the logical design of how to handle errors and exception processes, including load errors due to business rules, transformation failures, technical failures or data quality issues. To reduce build complexity, time should be spent on this task trying to generalise it as much as possible to improve reusability.
Regardless of the integration logic required, time must be spent defining the rules for extraction and loading, including specifics around referential integrity or how to handle orphan records. A first attempt should also be made at this stage to generalise some of the load rules to improve reuse, reuse is also re-visited in physical design.
There are 5 sub-tasks to this task:
Step: Define Data Validation Processes
Objective: This section defines the data validation rules to verify the correctness of the data. This includes verification of the data format and of the values where they can be checked against a discrete set of valid values.
Simple data cleansing transformations required at the data element level can be included here. Data validation may involve use of a operationalised set of data profiling rules.
Step: Define Referential Integrity Processes
Objective: This section defines the required referential integrity rules and the processes required to handle violations. Dealing with referential integrity issues can be one of the most complex areas for integration and the design of this area can have a significant impact in terms of data quality analysis requirements.
Step: Define Use of Data Re-Engineering Processes
Objective: This task defines how Data Re-engineering processes may be operationalised to be used as part of the ETL process. This capability may be provided by a separate technology (e.g. data cleansing tools which perform name and address cleansing and de-duplication) and hence one of the reasons to separate out the functionality as a separate task from mapping rules or business rule definitions. The task should define the logical capabilities that are required from the Data Re-Engineering component; the physical design provides details about integration into the Data Re-Engineering system and the more detailed use of the Data Management Service.
Step: Define Load Dependencies
Objective: Load Dependencies between the different process steps essentially define which tasks are pre-requisites to one on another to help provide the overall governing workflow sequence in the automation process. This adds further detail to the overall process flow that was defined earlier. Load Dependencies are often defined at the design level at the target information subject area level, and can be intra-source and/or inter-source.
Load Dependencies can be quite complex when multiple systems are involved, so ideally the overall workflow model for the ETL process flows will be found in a single repository as opposed to across multiple documents.
Step: Define Error Handling and Exception Processes
Objective: This step provides the logical design of how to handle errors and exception processes, including load errors due to business rules, transformation failures, technical failures or data quality issues. This is a very important part of the overall architecture and which also need to be expressed in the ETL design. To reduce build complexity, time should be spent on this task trying to generalise it as much as possible. Some exceptions may explicitly deal with data quality issues whilst other exceptions may be the result of system or load failures. Automatic and/or manual processes which deal with the collection and reporting of issues may be devised to deal with these issues.
Task: Define Data Consumer Update Processes
Objective: This task defines the manner in which the consumer will receive new data loads. For relational database sources, it is fairly simple to describe the database tables. For other types of targets, there needs to be a complete description of the targets and/or references to other documents which contain this information.
Step: Determine Consumer Update Techniques
Objective: There are many examples of database update techniques, such as:
Output can also be in the form of sequential files or messaging interfaces.
Step: Determine Consumer Integration Technologies
Objective: Integration Logic applies to the manner of interfacing into the source or target environments. In many data integration environments, this is reasonably straightforward as extracts and load processes will use SQL and will be batch-oriented. For some projects involving integration to application data stores, integration logic may involve adapter integration or an application API. Therefore, this task varies in complexity in relation to the integration points.
Whilst the final technical design is performed in the ETL Physical Design, the information to support the design is collected here.
Task: Define Metadata Integration Process
Objective: This task is used to ensure that metadata is being aligned between systems. Metadata will likely reside in different repositories and may require bi-directional flow of information among the repositories that must be controlled by well defined procedures, not all of which may be automated. This will involve technical metadata for the ETL process and may also include an architectural approach where a centralised metadata repository is used to integrate different aspects of business and technical metadata.
Potential Changes to this Activity
Wiki asset search