From MIKE2 Methodology
|
| This article is currently Under Construction. It is undergoing major changes as it is in the early stages of development. Users should help contribute to this article to get it to the point where is ready for a Peer Review.
|
| This deliverable template is used to describe a sample of the MIKE2.0 Methodology (typically at a task level). More templates are now being added to MIKE2.0 as this has been a frequently requested aspect of the methodology. Contributors are strongly encouraged to assist in this effort.
|
| Deliverable templates are illustrative as opposed to fully representative. Please help add examples to this template that are representative of the proposed output.
|
The Perform Multi-Table Profiling task involves analysing data across tables to look for redundancy and referential integrity issues. Tasks may done on a per-systems basis, sub-system basis or across systems. Key steps to table profiling include:
Using the profiling tool, perform relationship analysis on the selected tables to confirm uniqueness, referential integrity and to identify redundant data across tables
- Update relevant sections of the Data Quality Assessment Report with findings
- Discuss results with Source System SME
- Plan further analysis based on results
Multi-Table Profiling
Tools-Based Assessment (Data Profiling) Execution
Whereas the IM QuickScan-based assessment is used to uncover the insight and perceptions of key staff that are closest to and most affected by information quality issues, the tools-based assessment phase focuses on conducting an assessment of actual the data and data structures. The purpose of this phase is to complement the survey-based assessment feedback with objective, information-based results. It also provides a quality check against the results of the interviewing process.
Objectives of Data Profiling
- Identify data quality issues - measurements are taken against a number of dimensions, to help identify issues at the individual attribute level, at the level table and between tables.
- Capture metadata as a by-product of the process – Metadata is key to the success of subsequent phases of the data load including the ongoing use and maintenance of the data. Useful metadata is identified as part of the data profiling exercise and the tool-based approach will ensure it is captured in a reusable format.
- Identify business rules – The next step is to perform the data mapping. Data profiling will assist in gaining an understanding of the data held in the system and in identifying business rules for handling the data. This will feed into the data mapping exercise.
- To assess the ’fitness for purpose’ of the source system data to satisfy the business requirements. This assessment is often done in the context of either a Data Warehouse, CRM, or ERP system initiative. Therefore, the focus is on gaining a very detailed understanding of the source data that will feed these target systems, to ensure that the quality level is sufficient to meet the requirements of the target system.
This section is meant to provide guidelines as to the tests that are commonly used and the approach taken to measure data quality. Each project should consider these to be a starting point only and should ensure that the specific tests that will uncover greatest value for a client are uncovered as part of the discovery process for this client.
Overview
Data sources are profiled in multiple dimensions: down columns (column profiling); across rows (Table Profiling); and across tables (Multi-Table Profiling). The output from Data Profiling can then be used in a number of fashions, including data cleansing and mapping to a target environment.
Multi-Table Profiling
Multi-Table Analysis looks for relationships among columns existing in different tables, rather than just within a single table. Multi-Table profiling shows the relationships between tables using key fields as links to bridge the multiple tables. The aim is to analyse and determine the existence of referential integrity issues between each table analysed, e.g. orphans and clients without address and contact information. If the profiling analysis reveals two columns in different tables share the same set of values and one is not a foreign key and the other a primary key, than redundant data exists between the two tables. For example, an employee table and an employee demographics table might both contain a "home phone number" column and the same values within these columns. Thus, an employee
’ s home phone number might be stored redundantly in two separate tables.
Typical Investigations for Multi-Table Profiling
Multi-Table Profiling is conducted to verify the validity/uniqueness and accuracy of the key fields to ensure referential integrity. An example would be to assess the Customer Table vs. a Customer Loan Table
Customer Table:
| Field
| Test
| Description
|
| Customer ID
| Completeness
| Is a unique value populated within this field to uniquely identify each customer record? If so is there a high population ratio on this field, ie: 100% populated.
|
|
| Uniqueness
| Presence of a Primary Key or Foreign Key in relational databases. Are these keys unique to each record?
|
|
| Validity
| Are these key fields populated with valid value, ie: unique values?
|
|
| Format
| Are these values in a consistent format, ie: numeric or does it contain leading alphas etc.
|
Customer Loan Table:
| Field
| Test
| Description
|
| Customer ID
| Completeness
| Is a unique value populated within this field to uniquely identify each customer record within the Customer table? If so is there a high population ratio on this field, ie: 100% populated.
|
|
| Uniqueness
| Presence of a Primary Key or Foreign Key in relational databases. Are these keys unique to each record?
|
|
| Validity
| Are these key fields populated with valid value, ie: unique values?
|
|
| Format
| Are these values in a consistent format, ie: numeric or does it contain leading alphas etc.
|
Some vendors have multiple profiling products that are used to fulfil these functions. The attached documents provide further detail on vendor capabilities.
The Iterative Profiling Process
It is important to note that there is an iterative approach to profiling within each of the analysis steps below. Each step involves:
- Running the analysis within the appropriate Vendor Tool
- Analysing the results of each analysis
- Verify the results with the Source System SME
- Documenting the results (both in deliverables and within the profiling tools)
- Plan further analysis based on results
Each of these sub-steps are discussed in more detail in the task routes below.
Figure 1: Profiling is an Iterative Process
The data investigation process should verify what the source system owners say about the data against the actual data; it should also verify what the data says against what the source system owners say. This process is iterative – for example, a source system owner may say that all customer names must have a full first name and full surname. However, when this rule is checked against the data, this shows that 10% of the records have only a first initial. In this case, this must be discussed with the source system owner. This type of anomaly may be explained by a business rule that was applied to new data that was not applied to historical data. Further analysis is performed in this case to verify that all anomalous records were created before the expected data.
Steps in the Process
This section provides a process-driven approach for the executing Data Profiling. It should be noted that this process if often iterative and may start with a preliminary profiling of key systems before doing end-to-end profiling.
| Step 3 Multi-Table Profiling
|
| Objective:
| This task involves analysing data across tables to look for redundancy and referential integrity issues. Tasks may done on a per-systems basis, sub-system basis or across systems.
|
| Input:
| Table Profiling
|
| Process:
| Using the profiling tool, perform relationship analysis on the selected tables to confirm uniqueness, referential integrity and to identify redundant data across tables
Update relevant sections of the Data Quality Reportwith findings Discuss results with Source System SME Plan further analysis based on results
|
| Output:
| Redundancy Analysis will identify:
- Potential relationships with fields in other tables
- Redundant data between tables
- Potential referential integrity issues eg. Identification of orphans records
Deliverables include:
- Completion of the relevant sections of the XXX Data Quality Assessment.doc
|
Examples