Personal tools

Partners

Data Investigation Component

From MIKE2 Methodology

Jump to: navigation, search
Data Investigation

Data Investigation typically forms one of the key first steps in building capabilities for Information Development through a quantitative assessment of an organisations’ data. It also typically involved an ongoing monitoring process that is put in place once the solution has been implemented.

Contents

Data Discovery

Data Discovery (Profiling) should be considered a pre-requisite to any significant data integration effort. Data Profiling is done to remove the uncertainty and assumptions regarding the current information environment. As opposed to having to make assumptions, Data Profiling provides a discovery framework for identifying and analysing data quality issues and making fact-based decisions. This step is required to accurately cost and schedule any data transition or consolidation project.

Use of Data Profiling Tools

Data Profiling often uses a tool-based approach that enables the initial establishment of standards and initial formulation of metadata. It works by parsing and analysing free-form and single domain fields, and determining the number and frequency of unique values and classifying or assigning a business meaning to each occurrence of a value within a field.

As a result, Data Profiling:

  • Uncovers trends, potential anomalies, metadata discrepancies, and undocumented business practices
  • Identifies invalid or default values
  • Reveals common terminology used in a business area
  • Verifies the reliability of fields proposed as matching criteria

Data Profiling is carried out on the completeness of the fields, which determines the “usefulness” of the field for matching purposes. Incomplete fields mean that lower aggregate weights will be derived for the record, which can fail to meet the match cut-off requirements. Investigations are performed on both non-standardised and standardised fields. The purpose of investigating field patterns is to correct those patterns such that they can be standardised and used for matching, or to isolate those patterns for manual data quality improvement.

Information Derived from Data Profiling

Following this approach, a tools-based profiling assessment provides information about data structures and data content:

Data Profiling Discovers Content and Structural Information
Content InformationStructural Information
Attribute NamesFunctional dependencies between attributes
Data TypePrimary keys in data objects
Restrictions on LengthPrimary and foreign key pairs
Min/Max valuesRelationship rules between tables
Available values & frequencyIdentification of redundant attributes
Null rules and conventionMerge attribute pairs between sources
Unique rulesOrphan value identification
Character patterns
Percent values
Inconsistencies
Identification of business rules

Data Monitoring (Ongoing Data Profiling)

Data Monitoring will occur through the re-use of processes that facilitated the initial data profiling. The monitoring of known data problems will be pro-active and will have two major objectives:

  • Recalculate data quality metrics to track the effectiveness of corrections over time and to provide data quality feedback to key stakeholders and users; and
  • Identify where known data quality problems are being introduced back into a system, with the aim of pro-actively identifying and addressing areas of increased poor quality occurrence.

The data quality baseline and metrics are broken down by various dimensions are an important part of the ongoing DQ monitoring. By scheduling data profiling tasks and comparing these figures to the baseline and previous metrics, it is possible to identify and address known data quality problems, which may reappear in the system.

Powered by omCollab