Open Framework, Information Management Strategy & Collaborative Governance | Data & Social Methodology - MIKE2.0 Methodology
Wiki Home
Collapse Expand Close

Members
Collapse Expand Close

To join, please contact us.

Improve MIKE 2.0
Collapse Expand Close
Need somewhere to start? How about the most wanted pages; or the pages we know need more work; or even the stub that somebody else has started, but hasn't been able to finish. Or create a ticket for any issues you have found.

Data Re-Engineering Detailed Process Steps

From MIKE2.0 Methodology

Share/Save/Bookmark
Jump to: navigation, search

The approach below is primarily focused on providing an overview of the Data Re-engineering process. It align with the Data Re-Engineering Activity of the Overall Implementation Guide.

The MIKE2.0 approach for Data Re-Engineering derives some of its ideas and language from the TIQM Methodology proposed by Larry English [1]for Data Re-Engineering. TIQM provides a very comprehensive approach to preventing and resolving Data Quality issues and is recommended as a complementary reference guide for users of MIKE2.0.

Contents

Preparing for Data Re-Engineering

Overview

In this activity, results are reviewed from Data Investigation, priority data sets are identified and the high level requirements are reviewed. The identification of data quality issues is typically the output of a data profiling exercise, which has quantitatively identified the key data quality issues.

Key Steps in the Process

Data re engineering step 0.jpg
Step 1 – Review Results from Data Investigation Recommendations
Objective: In this task, the results of the Data Quality analysis work are reviewed in preparation for Data Re-engineering.
Input: Completion of Data Investigation
Go-Ahead on Data Re-Engineering project
Process: Review Detailed and Summary Data Quality Report findings
Output: Any issue with Data Quality Report Findings
Step 2 - Provide Extract Requirements
Objective: Identification of source extract files that will be moved to the staging area for Data Re-Engineering.
Input: Scope for Data Re-Engineering
Process: Refer to Data Investigation Process on Data Sourcing
Output: Extract Files loaded into staging environment

Data Standardisation

Overview

Data Standardisation brings data into a common format for migrating into target environment. Data Standardisation addresses problems related to:

  • Redundant domain values
  • Formatting problems
  • Non-atomic data from complex fields
  • Embedded meaning in data

The Data standardisation process is used to get data into an agreed-to atomic form, oftentimes mapping in data from complex fields using a vendor tool. Mapping rules from the standardisation processes are ideally fed into a metadata repository.

Key Deliverables include the following:

  • Design approach around Data Standardisation
  • Role assignments and ownership of standardised model
  • Metadata mappings from source to standardised model
  • Standardised data from source systems into a staging environment

Steps in the Process

Data re engineering step 1.jpg


This section provides a process-driven approach for the executing the standardisation process. These tasks are run serially; agreement on the set of common data elements may involve several iterations.

Step 1 Identify Common Data Elements for standardisation model
Objective: In this step, team members define the list of common data elements from the source system environment.
Input: Identification of data sources
Process: Data is standardized into a set of common data elements. Key examples of non-standard data include:
* Redundant domain values
* Formatting problems
* Non-atomic data from complex fields
* Embedded meaning in data
In this step the initial set of common data elements is created for review.
Output: Identification of common data elements
Step 2 Ensure ownership and signoff of standardised model
Objective: To define ownership within the project team and ensure there is an ultimate owner of the model that will be developed
Input: Definition of project roles and responsibilities
Process: Key business and technical stakeholders are identified that are relevant to data scope.
Team members are assigned stewardship responsibility for common model. Data stewards are assigned to act as reviewers of the model and act as the bridge between business and technical team.
In the Information Development organisation model, these responsibilities are already assigned. These roles would include:
* Information Development Architect
* Information Integration Standards Manger
* Metadata Development Manager
* Data Quality Manager
* Information Repository Manager
Workshops are help to gain final signoff on the model. This process may take multiple iterations to gain consensus. The Information Architect, assisted by Data Stewards should be the key facilitators in driving this process.
Output: Initial definition of standardised model
Step 3 Map source to standardised model
Objective: Initial source files are mapped to standardised model
Input: Signoff off on standardised model
Process: In this step, the source non-standardised data is mapped to the common model. Mapping rules would include:
* Merge rules for producer to consumer file-entity mapping
* Transformation rules for producer to consumer field mapping
Mapping of complex fields may involve use of a tool for the standardisation process. For Customer and Address fields in particular, off-the-shelf algorithms may used for breaking up complex fields into atomic units.

The standardisation process should be executed against the model and the new standardised model built.

Ideally, these mapping rules are stored into a metadata repository. For some vendor tools, this will be a by-product of the development process.
Output: Mapping of data into standardised model
Loading into staging area of standardised model

Data Correction

Overview

Data Correction typically addresses problems related to:

  • Missing data
  • Value issues due to range issues
  • Value issues related non-unique fields
  • Temporal or state issues
  • Data that can be referenced against existing reference sets

There are 3 important principles for this step:

  • Some data quality issues will be obvious and easily correctable
  • Some data quality issues will be correctable but not worth correcting from a cost-perspective
  • Some data quality issues will not be correctable as the information may be lost or the level of detail required has not been captured

Key Deliverables include:

  • Design of correction approach
  • Updates to metadata repository
  • Corrected data

Data Correction is therefore done as an iterative process. The goal should be to quickly address the obvious and easily-correctable issues in the first pass, and then address more complex issues after a cost-benefit analysis.

Steps in the Process

Data re engineering step 2.jpg


This section provides a process-driven approach for data correction. There may be multiple iterations required for some tasks, although the process explained below already contains one iterative step for handle obvious vs. complex data issues.

Step 1 Identify obvious data issues
Objective: Identify obvious data issues that are the initial candidates for correction
Input: Data Investigation process
Process: As an output of the profiling process, there should be a number of obvious data issues. The most easily recognizable would include:
* Missing data
* Null values
There are a number of other issues that are typically identified:
* Data out of normal distribution or domain value sets
* Data out of range values
* Duplicate data when unique values are expected
This data should ideally be identified through the use of a profiling tool during the Data Investigation process. Sometimes, a re-investigation of specific record sets will be required even if profiling has already been done.
Although a profiling tool is preferred, Data Investigation can be done through custom SQL. Knowledgeable resources will also typically be aware of some data issues.
Output: Obvious data issues are identified to scope initial phase of data re-engineering.
Step 2 Design Approach for Obvious Data Issues
Objective: The high level design is defined for resolving obvious data quality issues
Input: Identification of more obvious data issues
Process: The high level design should cover the following:
* Overall approach
* Scope of issues to be addressed
* Manual fixes required
* Correction routines to be used
* Traceability of change (whether it is being fixed in source environment and downstream impacts of change)
The design assets will ideally be stored into a metadata repository. Some tools support this more strongly than others.
Output: Design approach for initial phase of data re-engineering
Step 3 Execute Data Cleansing Process
Objective: The cleansing process is executed for obvious data issues
Input: Design for initial data cleansing process (for scope that is cleansed)
Process: The key aspects to the execution of the Data Cleansing process are:
* Cleansing process is executed, following the

’ 80-20

’ rule
* Results are reviewed with team
* Initial source files are archived in case "rollback" is required

Output: Cleansed data for initial phase of data re-engineering
Step 4 Prioritise Cleansing Based on Costs
Objective: More complex data issues are identified and prioritised base on cost
Input: Completion of initial data cleansing process
Process: Document any data quality issues not resolved in initial pass
Determine other data quality issues that remain
Estimate cost options and solution impacts of the following:
* Attempting to resolve data quality issues
* Not resolving the data quality issues and rejecting the load of this data into the target environment
* Not resolving the data quality issues and loading this data into the target environment (potentially flagging there is an issues with this data
These options should be reviewed with the governance team established during the initial stages of the Re-Engineering process.
Output: Cost model for moving forward with data cleansing
Step 5 Design Approach for Complex Data Issues
Objective: For Data Quality issues that have been identified as being cost-effective to resolve, design the process for resolution of these issues
Input: Cost-benefit model for data cleansing to prioritise requirements scope
Process: The high level design should cover the following:
* Overall approach
* Scope of issues to be addressed
* Manual fixes required
* Correction routines to be used
* Traceability of change (whether it is being fixed in source environment and downstream impacts of change)
More complex correction routines may be required than the more obvious data issues. A tools-based approach is recommended and the design should refrain from being overly detailed.
The design assets will ideally be stored into a metadata repository. Some tools support this more strongly than others.
Output: Design approach for next phase of data re-engineering
Step 6 Execute Data Cleansing Process
Objective: The cleansing process is executed for complex data issues
Input: Design for complex data cleansing process (for scope that is cleansed)
Process: The key aspects to the execution of the Data Cleansing process are:
* Cleansing process is executed
* Results are reviewed with team
* Initial source files are archived in case "rollback" is required
This step may be iterative, and continually balance the business value of data correction vs. the cost for the overall process.
Output: Cleansed data for this phase of data re-engineering
Step 7 Document results of Data Correction
Objective: It should be ensured that all Data Correction processes are documented. Ideally this will be done in a metadata repository.
Input: Completion of data correction process, although this can be done in parallel while data issues are being resolved.
Process: Potential information that should be captured include:
* Initial value
* Whether it was corrected
* Corrected Value
* Method of verification of correction
Output: Data Quality overview document

Data Matching and Consolidation

Overview

In this task, data is associated with other records to identify matching sets. Matching records can then either be consolidation to remove duplications or linked to another to form new associations.

Key Deliverables for Data Matching include:

  • Overall match criteria with business impact
  • Matched/consolidated data
  • Metadata mapping rules for data matching/consolidation

Steps in the Process

Data re engineering step 3.jpg
Step 1 Design Match process
Objective: In this task, the overall match process is designed. The different criteria to be used for matching and weighting are reviewed with the information management team and it is assured that the business understand the impacts of matching (including the impacts of over-matching).
Input: Corrected Data
Process: Key Steps in the Process include:
* Determine potential scenarios for matching, based on investigation results and business requirements
* Establish match criteria (this may be provided in tool). When using a tool, there will oftentimes be pre-defined match criteria modules.
* Weigh match criteria (pre-set weights my may be provided in a tool). When using a tool, there will oftentimes be pre-configured weights for matching.
* Assess the risks and impacts of incorrect matching. In a team review session, document and discuss the impacts of proposed matches from a business perspective, including over-matching are presented.
* Re-assess relative weights of each matching attributes. Weight levels are revised based on impacts on over-matching.
* Determine rules for Data Mastering (if relevant) regarding which data source is the primary authority, a secondary authority, or slave to any data changes
* Finalise the design process by defining the inventory of matching techniques to be used for each scenario
Output: Match process high level design
Step 2 Build Match Prototype
Objective: An initial prototype is built and tested on a subset of records
Input: Completion of matching high level design
Process: To test the prototype, matching should be done against a subset of data.

Matches should be evaluated to ensure they are in-fact duplicates or valid linkages

Matching results should be driven from relationship within the data ownership model.

The design assets will ideally be stored into a metadata repository. Some tools support this more strongly than others.
Output: A working prototype for matching, that works on at least a subset of data
Step 3 Align Matching Metadata
Objective: Ensure the matching design and output is stored into the metadata repository.
Input: Completion of high level design
Completion of match prototype
Process: As with other aspects of Data Re-Engineering, it should be ensured that matching rules are result-sets are stored into a metadata repository.

In this step, the source non-standardised data is mapped to the common model. Mapping rules would include:

* Merge rules for producer to consumer file-entity mapping
* Transformation rules for producer to consumer field mapping
* Cross-reference of any key between systems that would need to be changed

Ideally, these mapping rules are stored into a metadata repository. For some vendor tools, this will be a by-product of the development process.
Output: If a metadata repository is being used, it should be ensured that this information is loaded up as part of the design and implementation process
Step 4 Execute Overall Match Process
Objective: In this step, the process is executed for the full set of records. Any changed information can be stored as metadata.
Input: Completion of high level design
Completion of match prototype
Process: Key Steps in the process include:
* Matching process is executed
* Results are reviewed with team
* Revisions can be made based on match results and design expectations
* Initial source files are archived in case "rollback" is required
* Final signoff on match process
The implementation assets will ideally be stored into a metadata repository. Some tools support this more strongly than others.
Output: Matched data – this may involve de-duplication or new linkages

Data Enrichment

Overview

Data Enrichment typically refers to the supplementing on an organisation’s internal data with data from external sources. Types of data that is typically used for enrichment data:

  • Personal and household data
  • External market research data such as product interests
  • Geographical data
  • Micro and macro economic data
  • Postal Data, such as Delivery Point Identifiers (DPID)
  • Demographic information
  • World event information

After data has been standardised, corrected and matched, enriching data is basically the same as adding other source data.

Key Deliverables for Data Enrichment include:

  • Design for enrichment
  • Changes to data model and meta-model
  • New data load (refer to ETL solution approach)

Steps in the Process

Data re engineering step 4.jpg
Step 1 Determine requirements for Data Enrichment
Objective: Understand the business needs and the proposed benefit provided by external data
Input: Functional requirements that apply to needs for Data Enrichment
Process: Key Steps in the Process include:
* Understand gaps from existing data in mapping to business requirements.
* Define business value provided by supplementary data and how it maps into business requirements.
* Define options for existing data feeds.
* Provide estimated cost-benefit analysis from purchasing additional source data. This should be high-level and presented to the team.
Output: Selected sources for data enrichment
Step 2 Source Extract Definition and Design
Objective: Determine source to be used, how often it will need to be reloaded and whether it may contain any potential data issues.
Input: Sources for data enrichment
Process: Key Steps in the Process include;
* Determine frequency of source data loads
* Determine whether data loads will be primary set of information in the organisation or whether it needs to be checked against an authoritative source
* Determine method for receiving data (extract load, external connection, etc.)
* Determine frequency of source data loads
* Refer to Data Investigation process for definition of source system extracts
Output: Source extract definition and logical design of extract
Step 3 Update Target Model Design
Objective: Update target data model to reflect any requirements for enrichment data
Input: Functional information requirements
Process: The target data model should be defined as part of the initial requirements, but may require some minor changes (especially at the physical level) to accommodate new data load.

Follow the data modeling process for this area.
Output: Extensions made to target data model
Step 4 Supplement initial data with enriched data
Objective: Load of enrichment data into target environment
Input:
Determination of source extracts
Target Data Model design
Extract Logical Design
Process: Follow the ETL solution process for this area
Output: Enrichment data added to core data

References

  1. Improving Data Warehouse and Business Information Quality: Methods for Reducing Costs and Increasing Profits. Larry English (John Wiley & Sons Inc, 1999)
Wiki Contributors
Collapse Expand Close

View more contributors