From MIKE2 Methodology
Activity: Data Profiling
Objective
Data Profiling focuses on conducting an assessment of actual data and data structures. It helps provide the following:
- Identifies data quality issues - measurements are taken against a number of dimensions, to help identify issues at the individual attribute level, at the table-level and between tables.
- Captures metadata as a by-product of the process – Metadata is key to the success of subsequent phases of the data load including the ongoing use and maintenance of the data. Useful metadata is identified as part of the data profiling exercise and the tool-based approach will ensure it is captured in a reusable format.
- Identifies business rules – The next step is to perform the data mapping. Data profiling will assist in gaining an understanding of the data held in the system and in identifying business rules for handling the data. This will feed into the future data mapping exercise.
- Assesses the ’fitness for purpose’ of the source system data to satisfy the business requirements. This assessment is often done in the context of a Data Warehouse, CRM, or ERP system initiative. Therefore, the focus is on gaining a very detailed understanding of the source data that will feed these target systems, to ensure that the quality level is sufficient to meet the requirements of the target system.
The purpose of this phase is to provide objective, information-based results around information analysis.
Major Deliverables
- Data Quality Assessment Report (per Source System)
- Data Quality Metrics updated to Metadata Repository
- Mapping Rules and Business Rules updated to Metadata Repository
Tasks
Prepare for Assessment
Objective:
In this task, it ensured that the profiling environment is ready and the scope of information to be investigated is agreed-upon within the team and signed off by the client. This may be challenging as at this stage some of the other requirements may still be somewhat undefined. Minimising any gaps in time in getting the necessary extract files is a critical area dependency and risk area during this task. As profiling requires production extracts, the timelines for procurement of data may be significant.
Input:
- Detailed Business Requirements for Increment
- Upgraded Development Environment
- Configuration Management Baseline
- Metric Categories and Measurement Techniques
Output:
Perform Column Profiling
Objective:
This task involves profiling the data found in a single column/field in either a table or a flat file. It involves analysis of simple and complex fields. Each task is done on a per-system (or subset of a system) basis.
Key steps to column profiling include:
- Understand all the fields and document their descriptions in the profiling tool
- Now that fields are understood, analyse the fields that have been identified as being of interest during the interviewing-based assessment
- Update relevant sections of the Data Quality Assessment Report with findings
- Determine additional tables which may be required to provide the desired data
Simple and Complex field profiling may be split into separate tasks.
Input:
- Information Requirements for column-level data analysis
- Relevant data extracts
Output:
Perform Table Profiling
Objective:
This task involves analysing data across rows of a single table to establish dependencies between attributes within each table. Each task is done on a per-system (or subset of a system) basis.
Key steps to table profiling include:
- Run table and primary key analysis on the selected entities using the profiling tool
- Review results with the Source System SME
- Perform further analysis on exceptions and anomalies, eg broken primary keys
- Update relevant sections of the Data Quality Assessment Report with findings
On completion, further analysis may be planned based on results.
Input:
- Completion of Column Profiling
- Information Requirements for table-level data analysis
- Relevant data extracts
Output:
Perform Multi-Table Profiling
Objective:
This task involves analysing data across tables to look for redundancy and referential integrity issues. Tasks may done on a per-systems basis, sub-system basis or across systems. Key steps to table profiling include:
- Using the profiling tool, perform relationship analysis on the selected tables to confirm uniqueness, referential integrity and to identify redundant data across tables
- Update relevant sections of the Data Quality Assessment Report with findings
- Discuss results with Source System SME
- Plan further analysis based on results
Input:
- Completion of Table Profiling
- Information Requirements for multi-table level data analysis
- Relevant data extracts
Output:
Finalise Data Quality Report
Objective:
This step will complete and issue for signoff the Data Quality Assessment Report. Sections of the Data Quality Assessment Report and metadata repository should be populated throughout the End-to-End profiling exercise. This step is to complete remaining sections and to make a final recommendation on whether this data should be loaded into the target system. On completion, there should be a formal walkthrough, review and final signoff.
Input:
- Completion of Column Profiling
- Completion of Table Profiling
- Completion of Multi-Table Profiling
Output:
Core Supporting Assets
Yellow Flags
- Profiling team not provided with representative production data
- Profiling result-set doesn't directly flow into the following activities:
- ETL Integration
- Metadata related to data quality
- Data Re-Engineering approach
Key Resource Requirements