Open Framework, Information Management Strategy & Collaborative Governance | Data & Social Methodology - MIKE2.0 Methodology
Wiki Home
Collapse Expand Close

Members
Collapse Expand Close

To join, please contact us.

Improve MIKE 2.0
Collapse Expand Close
Need somewhere to start? How about the most wanted pages; or the pages we know need more work; or even the stub that somebody else has started, but hasn't been able to finish. Or create a ticket for any issues you have found.

Data Integration Physical Design

From MIKE2.0 Methodology

Share/Save/Bookmark
Jump to: navigation, search

Activities in Phase 4
Phase 4 - Design Increment
Content Model Relationship

Contents

Activity: ETL Physical Design

Objective. The ETL Physical Design provides an overview of how ETL processes are to be implemented within the technology environment. It will cover best practices that are specifically to be used with the vendor product, the detailed automation design, and (as follow-on from the logical design) design around use of common jobs, integration technologies, exception handling, and use of re-engineering processes. It also covers the required capabilities that are needed to handle other aspects of non-functional design (although most of this will be in the Solution Architecture).

Once again, the term ETL is used in a logical sense in this design approach, the tasks below could also be applied to middleware that is more focused on real-time integration. The use of GUI-based design tools mean that much of this work can be directly done and documented with the tool. At least some aspects of this design will need to be done outside the tool (e.g. standards) and the team should be cautious of letting the use of GUI-driven tools remove the need for the physical design process (consequently resulting in issues such as poor reusability and performance) . On the other hand, the team should also be cautious of over-designing within documents like MS Word – there should be an effort to take advantage of the GUI based tools. The Physical Design is often best done through a combination of written design and prototyping, lead through team-based design and alignment with the overall architecture.

As with the ETL Logical Design, many of these tasks can be performed in parallel – they are more representative of the key aspects to be covered in the physical ETL design than a stepwise list of tasks to be performed.

The approach put forward in this ETL Physical Design generally assumes the use of an off-the-shelf ETL product but can be generally be applied in the use of a bespoke solution.

Major Deliverables
  • ETL Physical Design
  • Updated Metadata Repository


Common assets:

  • The MIKE2.0 Data Integration Solution provides a holistic approach to Data Integration that includes design standards that can be applied for physical design
  • Product specific techniques can be referenced as part of the design process
Tasks

Task: Define Overall ETL Architecture Standards and Environment

Objective: These standards are specific the ETL product being used and the environment in which it will be deployed. There are 4 steps to this task:

  • Release Management: The ETL version control approach that will be used; including version control within the tool itself.
  • Environments Standards: The ETL environment will be physically deployed in development, testing and production. This will generally be covered in the Solution Architecture.
  • Rollback and Recovery: The strategy for handling load failures. This will include recommendations on whether milestone points and staging will not be required for restarts.
  • ETL Security Standards: The strategy for ETL security, which fits into the overall Security architecture across the environment.

Recommended Standards for this task can be found in the MIKE2 ETL Solution Guide in the section on ETL Architecture Standards and Environment Design.


Input:

  • ETL Conceptual Design/Solution Architecture
  • Overall SDLC Standards
  • Environment Configuration


Output:

  • ETL Physical Design Updated with ETL Architecture and Design Standards

Task: Define ETL Job Design Standards

Objective: When using an off-the-shelf ETL product, principles for software development do not change: we want our code to be reusable, robust, flexible, and manageable. To assist in the development, a set of best practices should be created for the implementation to follow. Failure to implement these practices usually result in problems further down the track, such as a higher cost of future development, increased time spent on administration tasks, and problems with reliability.

Listed below are the standards that should be identified on a typical project:

  • Naming Conventions that will be used across the ETL integration environment.
  • Error Handling: proposed standards for error trapping of jobs. This should be at a standards level, with detail of the design covered explicitly in a separate section of the physical design.
  • Process Reporting: status and row counts of jobs should be retrieved for accurate process reporting.
  • Notification: the manner in which information about successful and unsuccessful runs is delivered to the administrator and relevant stakeholders.
  • Parameter Management: the ability to manage job parameters across environments so that components can be delivered and run without requiring any modifications.
  • Optimisation: Standards for improving performance such as parallelism or hash files. The more detailed design aspects of this approach is a separate section of the physical design.
  • Reusability: Standards around simplified design and use of shared components.
  • Metadata Management: Standards around metadata management as they apply to the ETL design.

The MIKE2 ETL Solution Guide in the section on ETL Job Design Standards provides specific detail required for implementation of best practices in these areas.


Input:

  • ETL Conceptual Design/Solution Architecture
  • Logical ETL Design
  • Specific vendor technology


Output:

  • ETL Physical Design Updated with ETL Best Practices

Task: Design Producer Interfaces

Objective: This task provides the detailed design of the interfaces to the source environments. Building on the logical design in this area, this design task focuses on specifics of the technology and the mechanisms for integration. It is of greatest importance if an interface must be custom-built, and accesses an application through messaging or an API.

In this step the capture mechanisms are designed in detail, building on the steps from the logical design. Aspects of this may be done within the ETL tool.

The interfaces must be custom designed to extract the data. This could be reading flat files, interfacing with a message queuing system or calling an API.

If incremental updates are required, capture options for the acquisition will have been investigated and tentatively selected in Logical Design.

If sources are relational database tables, the interface definition may simply consist of the database connection parameters and the SQL required to extract the data.


Input:

  • Producer Acquisition Definition (ETL Logical Design)
  • Metadata Repository
  • ETL Design Standards


Output:

  • ETL Physical Design Updated with Incremental Capture Design
  • ETL Physical Design Updated with Updated Metadata Repository

Task: Design Transformations

Objective: Transformation design comprises a significant part of the ETL development process. This step includes design of metadata transformations and data transformations. In the Logical Design, definitions of the elements of transformation design may have been considered in isolation. In the Physical Design, a unified design with the following transformation elements is sought:

  • Validation and Exception Handling
  • Technical Transformations
  • Business Rule Transformations
  • Data Enhancement (Lookup, Merge, Join, etc.)
  • Integration with Data Re-Engineering Products

Transformation Design includes Metadata Transformation and Data Transformation. Metadata Transformation involves converting the metadata from source definitions to target definitions. This makes it easier to deliver the data to targets or to compare the data to targets and is important for tracing lineage of data from source to targets.

Transformations such as the following can be performed here:

  • Changing the column names to correspond to the target
  • Changing columns from NULL to NOT NULL or vice versa
  • Trimming or expanding column widths
  • Converting strings to dates or vice versa

Data Transformation involves converting data content or comparing source data to target data. It involves a huge range of ETL functions such as lookup, aggregation, merging, joining, splitting, enriching, cleansing, etc.

Metadata Transformation and Data Transformation can be combined into a single stage. If they are separated, it is recommended to have Metadata Transformation first to ensure the least number of ETL errors. Some ETL tools have failures in data transformation if the metadata transformation has not been completed.


Input:

  • Detailed Requirements for Increment
  • ETL Conceptual Design/Solution Architecture
  • Design Data Mediation and Quality Processes (ETL Logical Design)
  • Overall Process Flow (ETL Logical Design)
  • Metadata Repository
  • ETL Design Standards


Output:

  • ETL Physical Design Updated with Validation and Exception Handling Design
  • ETL Physical Design Updated with Technical Transformation Design
  • ETL Physical Design Updated with Business Rule Transformation Design

Task: Design Data Mediation and Quality Processes

Objective: This task has 3 sub-tasks:

  • Physical Design of Slowly Changing Dimension
  • Physical Design for use of Data Re-Engineering Processes
  • Physical Design of Mediation Processes

Step: Physical Design Slowly-Changing Dimension

Objective: If the requirements dictate that slowly changing dimensions are part of the solution, the ETL is designed at this step. The data model must contain the elements required, such as surrogate keys, effective and end dates, version numbers, etc. The ETL is designed to generate the values of those elements.


Input:

  • Detailed Business Requirements for Increment
  • ETL Conceptual Design/Solution Architecture
  • Overall Process Flow (ETL Logical Design)
  • Target Update Definition (ETL Logical Design)
  • Metadata Repository


Output:

  • ETL Physical Design Updated with Slowly-Changing Dimension Physical Design

Step: Physical Design for use of Data Re-Engineering Processes

Objective: This task focuses primarily on integration to the Data Re-Engineering systems and also defines in detail the capabilities that will be provided by the Data Management Services that are invoked from the Data Re-Engineering system. Physical design is important here as Data Re-Engineering processes can consume large amounts of resources and can take large amounts of time for processing; performance impacts and the need to use synchronous or asynchronous interfaces should also be defined at this stage.


Input:

  • Define Use of Data Re-Engineering Processes (ETL Logical Design)
  • Integration Logic (ETL Logical Design)
  • ETL Best Practices


Output:

  • ETL Physical Design Updated with Physical Design for Use of Data Re-Engineering Processes

Step: Physical Design of Mediation Processes

Objective: This task provides the physical design of the mediator process from producer to consumer. In some cases (e.g. simple data migration flow with very little transformation) this design may be quite straightforward. In cases where Data Mastering rules are complex, the mediation requirements may be quite sophisticated.


Input:

  • ETL Logical Design
  • ETL Best Practices
  • ETL Design Standards
  • Metadata Repository


Output:

  • ETL Physical Design Updated with Physical Design of Mediation Processes

Task: Design Consumer Interfaces

Objective: This task provides the detailed design aspects around the interface into the target environment. Building on the logical design in this area, this design task focuses on specifics of the technology and the mechanisms for integration. It is of greatest importance if an interface must be custom-built, and accesses an application through messaging or an API.

If targets are relational database tables, the interface definition may simply consist of the database connection parameters and the SQL required to populate the data, along with the database update technique.

If targets are not relational database tables, the interfaces must be custom designed to populate the data. This could be writing flat files, interfacing with a message queuing system or calling an API.


Input:

  • Data Models
  • Target Update Definition (ETL Logical Design)
  • ETL Logical Design (ETL Logical Design)
  • Slowly-Changing Dimension Process Design
  • Metadata Repository
  • ETL Design Standards


Output:

  • ETL Physical Design Updated with Target SQL Interface or Custom Interface Design
  • Updated Metadata Repository

Task: Design Automation and Monitoring Process

Objective: The Physical Automation and Monitoring Design ties together the overall data integration process to provide a mechanism for building and operating all automated steps in the ETL solution. It uses the standards defined earlier to put in place the specific design covering the following:

  • Job Control and Scheduling
  • File Delivery and Preliminary Processing
  • Restarting ETL Processes
  • Keep Tracking of Job Runs
  • E-mail Notifications
  • Job Statistics
  • File Cleanup

It also identifies how these automated steps will be implemented, whether it will be using a GUI-based design tool, a scripting language or another approach.

This task can be done iteratively with some certain aspects of physical design (interfaces, re-engineering processes, common jobs, functional design) but generally follows these tasks..


Input:

  • ETL Logical Design
  • ETL Best Practices


Output:

  • ETL Physical Automation Design

Task: Review ETL Design

Objective: The Physical Design must be reviewed prior to commencement of construction to ensure that the design will be optimal, particularly with respect to reusability, security, reliability and performance.

Reusability The design must use common jobs where possible or at least common templates for jobs where customisation is required. The common components and the standards for their usage in development are documented.

Security Security of data must be examined, both from a development and production viewpoint. The design must provide adequate security according to the client’s general standards and any specific standards outlined in the Security Architecture.

Reliability The design must be examined for ways to increase the reliability of the solution.

Performance Optimisation The design must be examined in light of performance issues if large volumes of data are to be processed.


Input:

  • All outputs from Physical Design


Output:

  • Updated ETL Physical Design

Role:Infrastructure Architect

Role:ETL Design Lead

Role:ETL Developers

Yellow Flags

  • Operations and Monitoring Design will not be able to be effectively supported by operations team
  • Reusable components or templates are not identified
  • Security or reliability risks identified in design that don’t meet with overall business requirements

Potential Changes to this Activity

  • This activity should be renamed Data Integration Physical Design to make it more general in nature. All tasks would stay the same but there will be some minor changes to the text.
  • There should be a tasks focused on the design of a reconciliation system. This can be a major task but it should at least receive some coverage within this activity.
  • This activity should be extended to include areas such as optimisation of integration processes and query tuning.
Wiki Contributors
Collapse Expand Close

View more contributors