Open Framework, Information Management Strategy & Collaborative Governance | Data & Social Methodology - MIKE2.0 Methodology
Wiki Home
Collapse Expand Close

Members
Collapse Expand Close

To join, please contact us.

Improve MIKE 2.0
Collapse Expand Close
Need somewhere to start? How about the most wanted pages; or the pages we know need more work; or even the stub that somebody else has started, but hasn't been able to finish. Or create a ticket for any issues you have found.

Perform Multi-Table Profiling Deliverable Template

From MIKE2.0 Methodology

Share/Save/Bookmark
Jump to: navigation, search
Under construction.png
This article is currently Under Construction. It is undergoing major changes as it is in the early stages of development. Users should help contribute to this article to get it to the point where is ready for a Peer Review.
This deliverable template is used to describe a sample of the MIKE2.0 Methodology (typically at a task level). More templates are now being added to MIKE2.0 as this has been a frequently requested aspect of the methodology. Contributors are strongly encouraged to assist in this effort.
Deliverable templates are illustrative as opposed to fully representative. Please help add examples to this template that are representative of the proposed output.

The Perform Multi-Table Profiling task involves analysing data across tables to look for redundancy and referential integrity issues. Tasks may done on a per-systems basis, sub-system basis or across systems. Key steps to table profiling include:

Using the profiling tool, perform relationship analysis on the selected tables to confirm uniqueness, referential integrity and to identify redundant data across tables

  • Update relevant sections of the Data Quality Assessment Report with findings
  • Discuss results with Source System SME
  • Plan further analysis based on results

Contents

Multi-Table Profiling

Tools-Based Assessment (Data Profiling) Execution

Whereas the IM QuickScan-based assessment is used to uncover the insight and perceptions of key staff that are closest to and most affected by information quality issues, the tools-based assessment phase focuses on conducting an assessment of actual the data and data structures. The purpose of this phase is to complement the survey-based assessment feedback with objective, information-based results. It also provides a quality check against the results of the interviewing process.

Center


Objectives of Data Profiling

  • Identify data quality issues - measurements are taken against a number of dimensions, to help identify issues at the individual attribute level, at the table level and between tables.
  • Capture metadata as a by-product of the process – Metadata is key to the success of subsequent phases of the data load including the ongoing use and maintenance of the data. Useful metadata is identified as part of the data profiling exercise and the tool-based approach will ensure it is captured in a reusable format.
  • Identify business rules – The next step is to perform the data mapping. Data profiling will assist in gaining an understanding of the data held in the system and in identifying business rules for handling the data. This will feed into the data mapping exercise.
  • To assess the ’fitness for purpose’ of the source system data to satisfy the business requirements. This assessment is often done in the context of either a Data Warehouse, CRM, or ERP system initiative. Therefore, the focus is on gaining a very detailed understanding of the source data that will feed these target systems, to ensure that the quality level is sufficient to meet the requirements of the target system.

This section is meant to provide guidelines as to the tests that are commonly used and the approach taken to measure data quality. Each project should consider these to be a starting point only and should ensure that the specific tests that will uncover greatest value for a client are uncovered as part of the discovery process for this client.

Overview

Data sources are profiled in multiple dimensions: down columns (column profiling); across rows (Table Profiling); and across tables (Multi-Table Profiling). The output from Data Profiling can then be used in a number of fashions, including data cleansing and mapping to a target environment.

Multi-Table Profiling

Multi-Table Analysis looks for relationships among columns existing in different tables, rather than just within a single table. Multi-Table profiling shows the relationships between tables using key fields as links to bridge the multiple tables. The aim is to analyse and determine the existence of referential integrity issues between each table analysed, e.g. orphans and clients without address and contact information. If the profiling analysis reveals two columns in different tables share the same set of values and one is not a foreign key and the other a primary key, than redundant data exists between the two tables. For example, an employee table and an employee demographics table might both contain a "home phone number" column and the same values within these columns. Thus, an employee ’ s home phone number might be stored redundantly in two separate tables.

Typical Investigations for Multi-Table Profiling

Multi-Table Profiling is conducted to verify the validity/uniqueness and accuracy of the key fields to ensure referential integrity. An example would be to assess the Customer Table vs. a Customer Loan Table

Customer Table:

Field Test Description
Customer ID Completeness Is a unique value populated within this field to uniquely identify each customer record? If so is there a high population ratio on this field, ie: 100% populated.
  Uniqueness Presence of a Primary Key or Foreign Key in relational databases. Are these keys unique to each record?
  Validity Are these key fields populated with valid value, ie: unique values?
  Format Are these values in a consistent format, ie: numeric or does it contain leading alphas etc.

Customer Loan Table:

Field Test Description
Customer ID Completeness Is a unique value populated within this field to uniquely identify each customer record within the Customer table? If so is there a high population ratio on this field, ie: 100% populated.
  Uniqueness Presence of a Primary Key or Foreign Key in relational databases. Are these keys unique to each record?
  Validity Are these key fields populated with valid value, ie: unique values?
  Format Are these values in a consistent format, ie: numeric or does it contain leading alphas etc.

Some vendors have multiple profiling products that are used to fulfil these functions. The attached documents provide further detail on vendor capabilities.

The Iterative Profiling Process

It is important to note that there is an iterative approach to profiling within each of the analysis steps below. Each step involves:

  • Running the analysis within the appropriate Vendor Tool
  • Analysing the results of each analysis
  • Verify the results with the Source System SME
  • Documenting the results (both in deliverables and within the profiling tools)
  • Plan further analysis based on results

Each of these sub-steps are discussed in more detail in the task routes below.

Profiling is an Iterative Process

Figure 1: Profiling is an Iterative Process

The data investigation process should verify what the source system owners say about the data against the actual data; it should also verify what the data says against what the source system owners say. This process is iterative – for example, a source system owner may say that all customer names must have a full first name and full surname. However, when this rule is checked against the data, this shows that 10% of the records have only a first initial. In this case, this must be discussed with the source system owner. This type of anomaly may be explained by a business rule that was applied to new data that was not applied to historical data. Further analysis is performed in this case to verify that all anomalous records were created before the expected data.

Steps in the Process

This section provides a process-driven approach for the executing Data Profiling. It should be noted that this process if often iterative and may start with a preliminary profiling of key systems before doing end-to-end profiling.

Step 3 Multi-Table Profiling
Objective: This task involves analysing data across tables to look for redundancy and referential integrity issues. Tasks may done on a per-systems basis, sub-system basis or across systems.
Input: Table Profiling
Process: Using the profiling tool, perform relationship analysis on the selected tables to confirm uniqueness, referential integrity and to identify redundant data across tables

Update relevant sections of the Data Quality Reportwith findings
Discuss results with Source System SME
Plan further analysis based on results
Output: Redundancy Analysis will identify:
  • Potential relationships with fields in other tables
  • Redundant data between tables
  • Potential referential integrity issues eg. Identification of orphans records
    Deliverables include:
  • Completion of the relevant sections of the XXX Data Quality Assessment.doc

Examples

Wiki Contributors
Collapse Expand Close