Open Framework, Information Management Strategy & Collaborative Governance | Data & Social Methodology - MIKE2.0 Methodology
Wiki Home
Collapse Expand Close

Members
Collapse Expand Close

To join, please contact us.

Improve MIKE 2.0
Collapse Expand Close
Need somewhere to start? How about the most wanted pages; or the pages we know need more work; or even the stub that somebody else has started, but hasn't been able to finish. Or create a ticket for any issues you have found.

Data Integration Logical Design

From MIKE2.0 Methodology

Share/Save/Bookmark(Redirected from ETL Logical Design)
Jump to: navigation, search

Activities in Phase 4
Phase 4 - Design Increment
Content Model Relationship

Contents

Activity: ETL Logical Design

Objective. The ETL Logical Design provides the key design approach for integrating data from the source to target environment. It is technology independent and is complemented by the ETL Physical Design which defines the implementation using a vendor integration technology.

The terms ETL is used in a logical sense in this design approach, the tasks below could also be applied to middleware that is more focused on real-time integration.

As ETL technologies advance and continue to move to more model-driven development, more of the tasks below will be done within a software tool. They will also more directly involve interactions with Data Profiling, Data Re-Engineering and Metadata Management tools. Many vendors have already moved in this direction. Many clients, however, are still more comfortable with more traditional forms of design documentation (MS Word, Excel) so the team may still need to employ more traditional design strategies. Regardless of the approach, the tasks below generally define the key areas of the design process.

The term "Metadata Repository" is used in this document to mean the collection of metadata as it pertains to ETL. Historically, metadata was captured in spreadsheets and word documents. Ideally, design artefacts will would be stored in an metadata repository that is part of an integration suite or a centralised hub.

Many of these tasks can be performed in parallel – they are more representative of the key aspects to be covered in the logical ETL design than a stepwise list of tasks to be performed. Logical design may be broken up into several documents (often by information subject area) or ideally in a centralised design repository.

Some of the tasks related to ETL Logical and Physical Design referenced a leading guide on this subject [1] that can be used in a complementary fashion with the MIKE2 Methodology.

The ETL Logical Design will oftentimes be in a word document form, and may be delivered in multiple documents on a per-subject area basis. The recommended approach is to try and move away from static documents and into the use of more active metadata repositories for holding design assets.

Major Deliverables
  • Overall Process Flow
  • Source Acquisition Definition
  • Load Dependency Definition
  • Data Quality Process Definition
  • Target Update Definition
  • Metadata Integration Definition


Common assets:

Tasks

Task: Prepare for ETL Design

Objective:



Input:

  • ETL Conceptual Design/Solution Architecture
  • Results from Data Profiling
  • Data Quality Assessment Report


Output:

  • Ready to begin ETL Design
  • Relevant Data Extracts Defined

Task: Define Overall Process Flow

Objective: The Overall Process Flow provides the high-level view of the workflow process that is used to provide integration of data between systems. The visual process flow output of this task gives users a quick description of the flow of information that occurs. This is generally a very simple process flow for data integration scenarios such as Data Warehouses or Data Migrations. The process flow shouldn’t be focused on showing an end-to-end flow; it should show a source-to-target flow. In a Data Warehouse that uses an ODS architecture that feeds into a set of de-normalised Data Marts, for example, the ETL process to load into the ODS and that which loads into the Data Mart will be broken up into 2 separate process flows.

The Overall Process Flow is ideally built a software development tool as opposed to a static document.


Input:

  • Detailed Business Requirements for Increment
  • Overall Blueprint Architecture
  • ETL Conceptual Design/Solution Architecture


Output:

  • ETL Logical Design updated with Overall Process Flow

Task: Define Producer Acquisition Processes

Objective: In this task, the logical design is specified in relation to extracting data from source systems. The steps within this task:

  • Define Staging Area
  • Determine Producer Extract
  • Determine Producer Event
  • Determine Extract Frequencies
  • Define Incremental Data Capture Options
  • Define Producer Integration Technologies


Input:

  • Detailed Business Requirements for Increment
  • Overall Blueprint Architecture
  • ETL Conceptual Design/Solution Architecture


Output:

  • ETL Logical Design updated with Overall Process Flow

Step: Define Staging Area

Objective: The initial steps are taken for defining a staging area, which may be needed for loading data extracts.


Input:

  • Detailed Business Requirements for Increment
  • Overall Blueprint Architecture
  • ETL Conceptual Design/Solution Architecture


Output:

  • ETL Logical Design updated with Overall Process Flow

Step: Determine Producer Extract

Objective: Each source is described, including its technology for information storage and integration. For relational database sources, it is fairly simple to describe the database tables. For other types of sources, there needs to be a complete description of the sources and/or references to other documents which contain this information.


Input:

  • Detailed Business Requirements for Increment
  • Source System Data Elements
  • Overall Blueprint Architecture
  • ETL Conceptual Design/Solution Architecture


Output:

  • ETL Logical Design updated with Producer Acquisition Definition

Step: Determine Producer Event

Objective: This step defines the triggers that initiate extraction of information from source systems. It links directly into the overall process flow. Data Acquisition events may be time or event-based; in most data integration implementations (e.g. migrations, warehouses) data acquisition is still done in a batch-oriented fashion although there continues to be more of a move to event-oriented integration.

For many data integration projects, data is typically moved into a staging area immediately after acquisition. Loading data into staging avoids the need for source systems to be accessed multiple times, minimises the number of extracts required from data sources, provides a single version of data for multiple uses, and ensures consistency of data to the target environment.


Input:

  • Detailed Business Requirements for Increment
  • ETL Conceptual Design/Solution Architecture


Output:

  • ETL Logical Design updated with Event Definition for Acquisition of Data

Step: Determine Extract Frequencies

Objective: As well the event that initiates the extract, the extract frequencies must all be defined. The Business Requirements should state the expected frequency of extracts from each source; using a BusinessTime model may help provide an easy way illustrate frequencies of all loads across the environment.


Input:

  • Detailed Business Requirements for Increment
  • ETL Conceptual Design/Solution Architecture


Output:

  • ETL Logical Design updated with Source Extract Frequencies

Step: Define Incremental Data Capture Options

Objective: If incremental updates are required, capture options for its acquisition are investigated and at least a tentative selection is made. Design is performed in the Physical Design phase.


Input:

  • Detailed Business Requirements for Increment
  • ETL Conceptual Design/Solution Architecture


Output:

  • ETL Logical Design updated with Incremental Data Capture Options

Step: Define Producer Integration Technologies

Objective: Integration logic applies to the manner of interfacing into the source or target environments. In many data integration environments, this is reasonably straightforward as extracts and load processes will use SQL and will be batch-oriented. For some legacy systems without a relational data store, SQL is not an integration option; for some projects involving integration to application data stores, integration logic may involve adapter integration or an application API. Therefore, this task varies in complexity in relation to the integration points.

The need to use synchronous or asynchronous interfaces is also defined at this stage.


Input:

  • Detailed Business Requirements for Increment
  • ETL Conceptual Design/Solution Architecture
  • Overall Blueprint Architecture


Output:

  • ETL Logical Design updated with Producer Integration Technologies

Task: Define Data Mapping of Producer to Consumer

Objective: The purpose of this task is to develop a detailed mapping of the producer system data elements to the consumer data element structure. This task involves creating a matrix that identifies the necessary source system elements and maps them to the database elements defined for the increment. Data Profiling tools can help define the structures of the source and target environment and ideally the source and target definitions and transformation rules will be stored in a metadata repository as opposed to a spreadsheet-based matrix.

There are 4 steps to this task:

  • Map Source Data Elements to Target Data Elements
  • Refine Required Producer System Data
  • Define Business Rule Transformations
  • Define Technical Transformations


Input:

  • Detailed Business Requirements for Increment
  • ETL Conceptual Design/Solution Architecture
  • Producer and Consumer Physical Data Models
  • Results from Data Profiling
  • Updated Metadata Repository (from Data Profiling)


Output:

  • ETL Logical Design updated with Data Mapping and Transformation Rules
  • Updated Metadata Repository

Step: Map Producer Data Elements to Consumer Data Elements

Objective: Source data is mapped to target data. Some data may be mappable 1:1. Some producer data elements may need to be combined to create a consumer data element.

In a simple project sources could be loaded directly to targets. In a complex project, there could be multiple mapping segments, for example:

  • Producer to Staging Area
  • Staging Area to Enterprise Data Warehouse
  • Enterprise Data Warehouse to Data Marts

Ideally, the mapping would go into a metadata repository as opposed to a static document such as a spreadsheet.


Input:

  • Catalogue of Producer Data Elements
  • Overall Process Flow
  • Data Models
  • Metadata Repository


Output:

  • ETL Logical Design updated with Mapping Rules Defined
  • Updated Metadata Repository

Step: Refine Required Producer System Data

Objective: Often, only selected source data is required to be loaded into the target environment. It is preferable to filter out the unnecessary data during data extraction from the source. The filtering criteria rules are defined in this step.


Input:

  • Detailed Business Requirements for Increment
  • Results of Data Profiling
  • Metadata Repository


Output:

  • ETL Logical Design updated with Refined Producer for only required data
  • Updated Metadata Repository

Step: Define Business Rule Transformations

Objective: Business rule transformations describe changes required for business reasons. These are based upon the content and meaning of the data, not just the format. Profiling tools can be valuable for helping to discover business rules.

Business rule transformations can include the following types of processing:

  • Code conversion
  • Calculations and derivations
  • Enhancement with other sources

Business Rules go hand in hand with transformation rules in that they are the defined within the integration layer and implemented in loading from source to the target environment. Whereas transformation rules apply more to mapping of fields, business rules generally refer to procedures that should be followed based on the content of the data. Like with transformation rules, profiling tools can be valuable for helping to discover business rules and these business rules should ideally be stored in a metadata repository.


Input:

  • Producer and Consumer Physical Data Models
  • Results of Data Profiling
  • Detailed Requirements for Increment
  • Metadata Repository


Output:

  • ETL Logical Design updated with Business Rule Transformations
  • Updated Metadata Repository

Step: Define Technical Transformations

Objective: Technical transformations describe any changes required for technical reasons rather than business reasons, such as formatting, data lengths, standardisation, etc. Profiling tools can be valuable for helping to discover data value lists, quality issues, etc.

Technical transformations can include the following types of processing:

  • Null value conversion
  • Data type conversion
  • Separation
  • Concatenation
  • Normalisation or de-normalisation
  • Aggregation
  • Surrogate key generation


Input:

  • Metadata Repository
  • Producer Data Models
  • Consumer Data Models


Output:

  • ETL Logical Design updated with Technical Rule Transformations
  • Updated Metadata Repository

Task: Define Data Mediation and Quality Processes

Objective: This step provides the logical design of how to handle errors and exception processes, including load errors due to business rules, transformation failures, technical failures or data quality issues. To reduce build complexity, time should be spent on this task trying to generalise it as much as possible to improve reusability.

Regardless of the integration logic required, time must be spent defining the rules for extraction and loading, including specifics around referential integrity or how to handle orphan records. A first attempt should also be made at this stage to generalise some of the load rules to improve reuse, reuse is also re-visited in physical design.

There are 5 sub-tasks to this task:

  • Define Data Validation Processes
  • Define Referential Integrity Processes
  • Define Use of Data Re-Engineering Processes
  • Define Load Dependencies
  • Define Error Handling and Exception Processes


Input:

  • Detailed Business Requirements for Increment
  • ETL Conceptual Design/Solution Architecture


Output:

  • ETL Logical Design updated with Event Definition for Acquisition of Data

Step: Define Data Validation Processes

Objective: This section defines the data validation rules to verify the correctness of the data. This includes verification of the data format and of the values where they can be checked against a discrete set of valid values.

Simple data cleansing transformations required at the data element level can be included here. Data validation may involve use of a operationalised set of data profiling rules.


Input:

  • ETL Conceptual Design/Solution Architecture
  • Detailed Business Requirements for Increment
  • Metadata Repository


Output:

  • ETL Logical Design updated with Defined Data Validation Rules
  • Updated Metadata Repository

Step: Define Referential Integrity Processes

Objective: This section defines the required referential integrity rules and the processes required to handle violations. Dealing with referential integrity issues can be one of the most complex areas for integration and the design of this area can have a significant impact in terms of data quality analysis requirements.


Input:

  • ETL Conceptual Design/Solution Architecture
  • Detailed Requirements for Increment
  • Metadata Repository
  • Target Data Models


Output:

  • ETL Logical Design updated with Defined Referential Integrity Processes
  • Updated Metadata Repository

Step: Define Use of Data Re-Engineering Processes

Objective: This task defines how Data Re-engineering processes may be operationalised to be used as part of the ETL process. This capability may be provided by a separate technology (e.g. data cleansing tools which perform name and address cleansing and de-duplication) and hence one of the reasons to separate out the functionality as a separate task from mapping rules or business rule definitions. The task should define the logical capabilities that are required from the Data Re-Engineering component; the physical design provides details about integration into the Data Re-Engineering system and the more detailed use of the Data Management Service.


Input:

  • ETL Conceptual Design/Solution Architecture
  • Data Quality Assessment Report
  • Existing Data Re-Engineering Processes
  • Metadata Repository
  • Detailed Business Requirements for Increment


Output:

  • ETL Logical Design updated with Defined Use of Data Re-Engineering Processes
  • Updated Metadata Repository

Step: Define Load Dependencies

Objective: Load Dependencies between the different process steps essentially define which tasks are pre-requisites to one on another to help provide the overall governing workflow sequence in the automation process. This adds further detail to the overall process flow that was defined earlier. Load Dependencies are often defined at the design level at the target information subject area level, and can be intra-source and/or inter-source.

Load Dependencies can be quite complex when multiple systems are involved, so ideally the overall workflow model for the ETL process flows will be found in a single repository as opposed to across multiple documents.


Input:

  • ETL Conceptual Design/Solution Architecture
  • Producer to Consumer Process Flow
  • Common Jobs


Output:

  • ETL Logical Design updated with Overall Load Dependencies

Step: Define Error Handling and Exception Processes

Objective: This step provides the logical design of how to handle errors and exception processes, including load errors due to business rules, transformation failures, technical failures or data quality issues. This is a very important part of the overall architecture and which also need to be expressed in the ETL design. To reduce build complexity, time should be spent on this task trying to generalise it as much as possible. Some exceptions may explicitly deal with data quality issues whilst other exceptions may be the result of system or load failures. Automatic and/or manual processes which deal with the collection and reporting of issues may be devised to deal with these issues.


Input:

  • ETL Conceptual Design/Solution Architecture
  • Producer to Consumer Process Flow
  • Common Jobs


Output:

  • ETL Logical Design updated with Error Handling and Exception Process

Task: Define Data Consumer Update Processes

Objective: This task defines the manner in which the consumer will receive new data loads. For relational database sources, it is fairly simple to describe the database tables. For other types of targets, there needs to be a complete description of the targets and/or references to other documents which contain this information.

Step: Determine Consumer Update Techniques

Objective: There are many examples of database update techniques, such as:

  • Full Refresh
  • Incremental Update
  • Incremental change for a Slowly-Changing Dimension (for Data Warehousing)
  • Transactional Insert

Output can also be in the form of sequential files or messaging interfaces.


Input:

  • Detailed Requirements for Increment
  • ETL Conceptual Design/Solution Architecture
  • Metadata Repository


Output:

  • ETL Logical Design updated with Consumer Update Definition

Step: Determine Consumer Integration Technologies

Objective: Integration Logic applies to the manner of interfacing into the source or target environments. In many data integration environments, this is reasonably straightforward as extracts and load processes will use SQL and will be batch-oriented. For some projects involving integration to application data stores, integration logic may involve adapter integration or an application API. Therefore, this task varies in complexity in relation to the integration points.

Whilst the final technical design is performed in the ETL Physical Design, the information to support the design is collected here.


Input:

  • Detailed Business Requirements for Increment
  • ETL Conceptual Design/Solution Architecture


Output:

  • ETL Logical Design updated with Updated Consumer Integration Technologies

Task: Define Metadata Integration Process

Objective: This task is used to ensure that metadata is being aligned between systems. Metadata will likely reside in different repositories and may require bi-directional flow of information among the repositories that must be controlled by well defined procedures, not all of which may be automated. This will involve technical metadata for the ETL process and may also include an architectural approach where a centralised metadata repository is used to integrate different aspects of business and technical metadata.


Input:

  • Solution Architecture
  • Source and Target Physical Data Models
  • Transformation Rules
  • Business Rules


Output:

  • Metadata Integration Design
  • Updates to Metadata Repository

Role:Infrastructure Architect

Role:ETL Design Lead

Role:ETL Developers

Role:Technical Analysts

Yellow Flags

  • Data Quality issues are not quantitatively understood in source systems
  • Major gaps in defining business rules for data transformation

Potential Changes to this Activity

  • This activity should be renamed Data Integration Logical Design to make it more general in nature. All tasks would stay the same but there will be some minor changes to the text.
  • There should be a tasks focused on the design of a reconciliation system. This can be a major task but it should at least receive some coverage within this activity.

References

  1. Data Warehouse: From Architecture to Implementation, Barry Devlin (Addison-Wesley Professional, 1996).

Further Reading

  • The Data Warehouse ETL Toolkit, Ralph Kimball and Joe Caserta (Wiley 2004)
Wiki Contributors
Collapse Expand Close

View more contributors