ETL Conceptual Design Deliverable Template
From MIKE2.0 Methodology
||This article is a stub. It is currently undergoing major changes as it is in the very early stages of development and is only a placeholder. Please help improve MIKE2.0 by adding to this article.
|This deliverable template is used to describe a sample of the MIKE2.0 Methodology (typically at a task level). More templates are now being added to MIKE2.0 as this has been a frequently requested aspect of the methodology. Contributors are strongly encouraged to assist in this effort.
|Deliverable templates are illustrative as opposed to fully representative. Please help add examples to this template that are representative of the proposed output.
The purpose of the ETL Conceptual Design task is to guide the overall solution approach that feeds into the ETL Logical and Physical Design. The purpose of the Conceptual Design is to introduce the areas of significant complexity in the early stages of a project.
The ETL Conceptual Design is at a high level, but should contain the following:
- Sketch of the overall flow
- List of sources
- Whether a staging area is to be used
- List of targets
- Major transformations
- Volume estimates
Frequency of update (Timing)
Based on the model of BusinessTime it has to be determined what kind of data has to integrated in which period of time without influencing the value of decisions based on it. Because this has significant influence on the conceptual and technical design of the ETL process, this task should be adresses at an early stage. Typical categories of latency are:
- (near) real-time
The demand for (near) real-time data integration is constantly growing. A hard definition of real-time integration would require the integration of new data within the same transactional context as the recording in an operational system. In the area of analytical systems it is often not necessary to fulfill this requirement to archive a competitive advantage. Nevertheless relevant data has to be available, when it is needed. So the term right time integration is occasionally used.
The classical batch process for data integration, which usually is performed at night or on weekends, does not fit to (near) real-time requirements.
Other promising concepts are available to accomplish the necessary tasks:
This group of methods uses technologies of data consolidation. Data from multiple source systems is transfered into a target system. A typical member of this group is the so called microbatch. It uses the classical ETL-Batch approach, but the time between scheduled executions is reduced to minutes or hours. User of this method can profit from sophisticated batch tools and well-optimized processes. Main disadvantages can be load peaks on the relevant systems, when microbatches are proceeded.
Continuous concepts are based on data propagation technologies. By using a Middleware_Component data and messages can reliably and timely be transfered between systems. As this technical approach was mainly developed for operational systems very low latencies can be reached. Because data is processed in small chunks and not in big batches load peaks are improbable.
These methods are using technologies based on data federation. A special transformation layer (also called mediator) is used to present an integrated, virtual view of the source systems. In opposition to continuous integration event driven concepts access source data on demand only. The data is then integrated and made available to the requester. With this approach the presented data is always current. Latency is depended on the slowest source system which contains requested data. Enterprise Information Integration is a popular member of event-driven integration.
Mention of significant complexities, such as:
- Slowly-changing dimensions
- Interface with a data re-engineering tool
- Very complex transformations
It will also be important to cover the overall approach integration architecture and how it will handle different types of scenarios. The ETL Conceptual Architecture should present the high level strategy related to:
- extraction increments
- full data loads
- data staging
- bulk load
- incremental updates
Where is fits in the SDLC
The Conceptual Design fits runs in parallel with a number of solution development activities, as show below.