From MIKE2 Methodology
Activity: Data Re-Engineering
Objective
Data Re-Engineering is a term used to describe a number of related functions for standardising data to a common format, correcting data quality issues, removing duplicate information/building linkages between records that did not exist previously, or enriching data with supplementary information.
The MIKE2.0 approach for Data Re-Engineering derives some of its ideas and language from the TIQM Methodology proposed by Larry English [1]for Data Re-Engineering. TIQM provides a very comprehensive approach to preventing and resolving Data Quality issues and is recommended as a complementary reference guide for users of MIKE2.0.
Data Re-Engineering is not always conducted within Phase 3 of the MIKE2.0 Methodology, it is an activity that may be conducted throughout the process. In many cases, however, it does make sense to try and address the major data issues before moving data into the target environment.
Major Deliverables
- Design, metadata assets and actual data changes that apply to the following steps:
- Business Case for addressing overall Data Quality issues
Tasks
Prepare for Data Re-Engineering
Objective:
In this task, the team prepares for Data Re-Engineering by ensuring that at least high level information requirements have been established, data extracts are available and the software development environment is ready. Data Profiling is typically a pre-requisite to this task, as it helps to quantitatively understand the data quality issues that exist beforehand and plan appropriately. Generally the same process for acquiring extracts for can be followed for Data Investigation.
Input:
- Detailed Business Requirements for Increment
- Data Profiling (typically)
- Configuration Management Baseline
- Development Environment Upgraded
Output:
Perform Data Standardisation
Objective:
Data Standardisation addresses problems related to:
- Redundant domain values
- Formatting problems
- Non-atomic data in complex fields
- Embedded meaning in data
The Data standardisation process is used to get data into an agreed-to atomic form, oftentimes mapping in data from complex fields using a vendor tool. Mapping rules from the standardisation processes are ideally fed into a metadata repository.
Input:
Output:
Perform Data Correction
Objective:
Data Correction typically addresses problems related to:
- Missing data
- Value issues due to range problems
- Value issues related non-unique fields
- Temporal or state issues
- Name and address data that can be referenced against existing reference sets
Input:
Output:
Perform Data Matching and Consolidation
Objective:
In this task, data is associated with other records to identify matching sets. Matching records can then either be consolidated to remove duplications or linked to another to form new associations.
Input:
Output:
Perform Data Enrichment
Objective:
Data Enrichment typically refers to the supplementing on an organisation’s internal data with data from external sources. Types of data that is typically used for enrichment data:
- Personal data such as date-of-birth and gender codes
- Geographical data
- Postal Data, such as Delivery Point Identifiers (DPID)
- Demographic information
- Economic data
- World event information
This provides an overall more robust set of information to make key business decisions
Input:
Output:
Finalise Business Summary of Data Quality Impact
Objective:
This task provides a summary of the root causes of the data quality issues that impact the business and recommendations for resolution of these issues. Recommendations may involve changes to source systems, improvements to business processes, increased validation, etc. The report produced should also involve financials to build a business case around resolution of these data quality issues and whether it makes sense to address them from a business perspective.
Input:
- Data Quality Report
- Data Re-Engineering Results
- Project Plan, Project Costs for Data Investigation and Re-Engineering
- Proposed Cost to fix data
Output:
Core Supporting Assets
- Data Investigation and Re-Engineering Solution provides a holistic approach to Data Investigation and Data Re-Engineering and makes reference to specific process steps to Data Investigation. It makes reference to logical and physical best practice.
Yellow Flags
- Root causes to data quality are not resolved - the core problems still exist
- The introduction of automated data quality rules could result in high degrees of load failures
- Re-engineering on one system will result in reconciliation issues with other systems
Key Resource Requirements
Potential Changes to this Activity
A few tasks may need to be added to this activity that are more commonly applied for Search. In particular: lemmatisation and spell checking that are commonly applied through a human interface. Alternatively, may expand the definition of standardisation to cover this area.
References
- ↑ Improving Data Warehouse and Business Information Quality: Methods for Reducing Costs and Increasing Profits. Larry English (John Wiley & Sons Inc, 1999)