From MIKE2 Methodology
|
| This article is currently Under Construction. It is undergoing major changes as it is in the early stages of development. Users should help contribute to this article to get it to the point where is ready for a Peer Review.
|
| This deliverable template is used to describe a sample of the MIKE2.0 Methodology (typically at a task level). More templates are now being added to MIKE2.0 as this has been a frequently requested aspect of the methodology. Contributors are strongly encouraged to assist in this effort.
|
| Deliverable templates are illustrative as opposed to fully representative. Please help add examples to this template that are representative of the proposed output.
|
Overview
Data Correction typically addresses problems related to:
- Missing data
- Value issues due to range issues
- Value issues related non-unique fields
- Temporal or state issues
- Data that can be referenced against existing reference sets
There are 3 important principles for this step:
- Some data quality issues will be obvious and easily correctable
- Some data quality issues will be correctable but not worth correcting from a cost-perspective
- Some data quality issues will not be correctable as the information may be lost or the level of detail required has not been captured
Key Deliverables include:
- Design of correction approach
- Updates to metadata repository
- Corrected data
Data Correction is therefore done as an iterative process. The goal should be to quickly address the obvious and easily-correctable issues in the first pass, and then address more complex issues after a cost-benefit analysis.
Steps in the Process
This section provides a process-driven approach for data correction. There may be multiple iterations required for some tasks, although the process explained below already contains one iterative step for handle obvious vs. complex data issues.
| Step 1 Identify obvious data issues
|
| Objective:
| Identify obvious data issues that are the initial candidates for correction
|
| Input:
| Data Investigation process
|
| Process:
| As an output of the profiling process, there should be a number of obvious data issues. The most easily recognizable would include:
- Missing data
- Null values
There are a number of other issues that are typically identified:
- Data out of normal distribution or domain value sets
- Data out of range values
- Duplicate data when unique values are expected
This data should ideally be identified through the use of a profiling tool during the Data Investigation process. Sometimes, a re-investigation of specific record sets will be required even if profiling has already been done. Although a profiling tool is preferred, Data Investigation can be done through custom SQL. Knowledgeable resources will also typically be aware of some data issues.
|
| Output:
| Obvious data issues are identification to scope initial phase of data re-engineering.
|
| Step 2 Design Approach for Obvious Data Issues
|
| Objective:
| The high level design is defined for resolving obvious data quality issues
|
| Input:
| Identification of more obvious data issues
|
| Process:
| The high level design should cover the following:
- Overall approach
- Scope of issues to be addressed
- Manual fixes required
- Correction routines to be used
- Traceability of change (whether it is being fixed in source environment and downstream impacts of change)
The design assets will ideally be stored into a metadata repository. Some tools support this more strongly than others.
|
| Output:
| Design approach for initial phase of data re-engineering
|
| Step 3 Execute Data Cleansing Process
|
| Objective:
| The cleansing process is executed for obvious data issues
|
| Input:
| Design for initial data cleansing process (for scope that is cleansed)
|
| Process:
| The key aspects to the execution of the Data Cleansing process are:
- Cleansing process is executed, following the 80-20 rule
- Results are reviewed with team
- Initial source files are archived in case "rollback" is required
|
| Output:
| Cleansed data for initial phase of data re-engineering
|
| Step 4 Prioritise Cleansing Based on Costs
|
| Objective:
| More complex data issues are identified and prioritised base on cost
|
| Input:
| Completion of initial data cleansing process
|
| Process:
| Document any data quality issues not resolved in initial pass Determine other data quality issues that remain Estimate cost options and solution impacts of the following:
- Attempting to resolve data quality issues
- Not resolving the data quality issues and rejecting the load of this data into the target environment
- Not resolving the data quality issues and loading this data into the target environment (potentially flagging there is an issues with this data
These options should be reviewed with the governance team established during the initial stages of the Re-Engineering process.
|
| Output:
| Cost model for moving forward with data cleansing
|
| Step 5 Design Approach for Complex Data Issues
|
| Objective:
| For Data Quality issues that have been identified as being cost-effective to resolve, design the process for resolution of these issues
|
| Input:
| Cost-benefit model for data cleansing to prioritise requirements scope
|
| Process:
| The high level design should cover the following:
- Overall approach
- Scope of issues to be addressed
- Manual fixes required
- Correction routines to be used
- Traceability of change (whether it is being fixed in source environment and downstream impacts of change)
More complex correction routines may be required than the more obvious data issues. A tools-based approach is recommended and the design should refrain from being overly detailed. The design assets will ideally be stored into a metadata repository. Some tools support this more strongly than others.
|
| Output:
| Design approach for next phase of data re-engineering
|
| Step 6 Execute Data Cleansing Process
|
| Objective:
| The cleansing process is executed for complex data issues
|
| Input:
| Design for complex data cleansing process (for scope that is cleansed)
|
| Process:
| The key aspects to the execution of the Data Cleansing process are:
- Cleansing process is executed
- Results are reviewed with team
- Initial source files are archived in case "rollback" is required
This step may be iterative, and continually balance the business value of data correction vs. the cost for the overall process.
|
| Output:
| Cleansed data for this phase of data re-engineering
|
| Step 7 Document results of Data Correction
|
| Objective:
| It should be ensured that all Data Correction processes are documented. Ideally this will be done in a metadata repository.
|
| Input:
| Completion of data correction process, although this can be done in parallel while data issues are being resolved.
|
| Process:
| Potential information that should be captured include:
- Initial value
- Whether it was corrected
- Corrected Value
- Method of verification of correction
|
| Output:
| Data Quality overview document
|
Examples