From MIKE2 Methodology
Data Investigation typically forms one of the key first steps in building capabilities for Information Development through a quantitative assessment of an organisations’ data. It also typically involved an ongoing monitoring process that is put in place once the solution has been implemented.
Data Discovery
Data Discovery (Profiling) should be considered a pre-requisite to any significant data integration effort. Data Profiling is done to remove the uncertainty and assumptions regarding the current information environment. As opposed to having to make assumptions, Data Profiling provides a discovery framework for identifying and analysing data quality issues and making fact-based decisions. This step is required to accurately cost and schedule any data transition or consolidation project.
Use of Data Profiling Tools
Data Profiling often uses a tool-based approach that enables the initial establishment of standards and initial formulation of metadata. It works by parsing and analysing free-form and single domain fields, and determining the number and frequency of unique values and classifying or assigning a business meaning to each occurrence of a value within a field.
As a result, Data Profiling:
- Uncovers trends, potential anomalies, metadata discrepancies, and undocumented business practices
- Identifies invalid or default values
- Reveals common terminology used in a business area
- Verifies the reliability of fields proposed as matching criteria
Data Profiling is carried out on the completeness of the fields, which determines the “usefulness” of the field for matching purposes. Incomplete fields mean that lower aggregate weights will be derived for the record, which can fail to meet the match cut-off requirements. Investigations are performed on both non-standardised and standardised fields. The purpose of investigating field patterns is to correct those patterns such that they can be standardised and used for matching, or to isolate those patterns for manual data quality improvement.
Information Derived from Data Profiling
Following this approach, a tools-based profiling assessment provides information about data structures and data content:
Data Profiling Discovers Content and Structural Information
| Content Information | Structural Information
|
| Attribute Names | Functional dependencies between attributes
|
| Data Type | Primary keys in data objects
|
| Restrictions on Length | Primary and foreign key pairs
|
| Min/Max values | Relationship rules between tables
|
| Available values & frequency | Identification of redundant attributes
|
| Null rules and convention | Merge attribute pairs between sources
|
| Unique rules | Orphan value identification
|
| Character patterns |
|
| Percent values |
|
| Inconsistencies |
|
| Identification of business rules |
|
|
Data Monitoring (Ongoing Data Profiling)
Data Monitoring will occur through the re-use of processes that facilitated the initial data profiling. The monitoring of known data problems will be pro-active and will have two major objectives:
- Recalculate data quality metrics to track the effectiveness of corrections over time and to provide data quality feedback to key stakeholders and users; and
- Identify where known data quality problems are being introduced back into a system, with the aim of pro-actively identifying and addressing areas of increased poor quality occurrence.
The data quality baseline and metrics are broken down by various dimensions are an important part of the ongoing DQ monitoring. By scheduling data profiling tasks and comparing these figures to the baseline and previous metrics, it is possible to identify and address known data quality problems, which may reappear in the system.