Personal tools

Partners

Perform Column Profiling Deliverable Template

From MIKE2 Methodology

Jump to: navigation, search
This deliverable template is used to describe a sample of the MIKE2.0 Methodology (typically at a task level). More templates are now being added to MIKE2.0 as this has been a frequently requested aspect of the methodology. Contributors are strongly encouraged to assist in this effort.
Deliverable templates are illustrative as opposed to fully representative. Please help add examples to this template that are representative of the proposed output.

Contents

Overview

The Perform Column Profiling task involves profiling the data found in a single column/field in either a table or a flat file. It involves analysis of simple and complex fields. Each task is done on a per-system (or subset of a system) basis.

Key steps to column profiling include:

  • Understand all the fields and document their descriptions in the profiling tool
  • Now that fields are understood, analyse the fields that have been identified as being of interest during the interviewing-based assessment
  • Update relevant sections of the Data Quality Assessment Report with findings
  • Determine additional tables which may be required to provide the desired data

Simple and Complex field profiling may be split into separate tasks.

Example 1 - Column Profiling

Listed below is a simple report of column profiling results. The results would likely be directly in a tool, this example shows the summarised output.

Summary of Data Quality Findings

This section presents the results of the investigations performed. With the stated aim of the Data Investigation being to determine the "fit for purpose" of the data held in the ABC Client and Contract tables for a XYZ solution, the overall results can be generally described as below average to average.

Many key fields are not populated, data has not been normalised and extensive [[Data Re-Engineering Component | re-engineering of the data is required to make it usable.

In particular key data elements are missing in most of the feeds (D.O.B. and Sex) which severely depletes the data's "fit for purpose" for the Client Match de-dupe process.

The column analysis was performed across all the columns in the table. However special attention should be given to the columns that allow a de-duplication process to be performed. This allows an accurate master client table to be created.

Of the specific 3rd Party provides the UGC data feed is particularly poor. It represents 67% if the total number of records but falls well short of what is required from a quality perspective.

Profiling Results

UGC Insurance

Column Column Name % Usable values Number of Valid Records Total Number
Date of Birth CL_DATE_OF_BIRTH 0% 0 371044
Surname CL_SURNAME 1.99% 7402 371044
First Name CL_FIRSTNAME 1.99% 7402 371044
Name CL_NAME 100% 371044 371044
Post Code CL_POSTCODE 100% 371044 371044
Suburb CL_SUBURB 0% 0 371044
Address CL_ADDRESS 100% 371044 371044
Sex CL_SEX 0% 0 371044
Addressee CL_ADDRESSEE 100% 371044 371044
State CL_STATE 0% 0 371044
Email CL_EMAIL_ADDRESS 0% 0 371044
Home Phone CL_HOME_PHONE 0% 0 371044
Work Phone CL_WORK_PHONE 0% 0 371044

Fit-for-purpose analyis : Client Match / De-Dupe

The data quality is sufficient to perform human based searches. The absence of D.O.B, sex, suburb and state make a confident match impossible. UGC Insurance also only provides initials, instead of first name. This dramatically weakens the overall use of the first name field, across all the data sources, even if those other sources have a first name field.

For example, to search by the first name "James". All UGC Insurance records can use "J" as a search value.

Fit-for-purpose analyis : XYZ Solution

From the above data we can see that the feed contains 100% CL_NAME. After an analysis of the CL_NAME data it is possible to populate CL_SURNAME from CL_NAME, however CL_FIRSTNAME cannot be obtained as only the initials are provided by UGC.

Given that CL_POSTCODE contains 100% of the values, it's also possible to use an iterative process to populate CL_SUBURB. A similar process could also be used to populate CL_STATE. Populating the Suburb and Surname fields will allow for quicker searches and result in a reduction in the number of false positives.

The lack of contact information (Email, Home and Work phone numbers) limits the current data to be only useful in performing inbound related services. No outbound activities will be possible based on the current data from UGC Insurance.

Rating: 2/5

Samples Search:

Criteria Results Time
CL_NAME like ’%JONES%'


CL_ADDRESS like ’%TWEED HEADS%'

10 records
3 Distinct Client Matches
8 sec
CL_NAME like ’ %WILSON%'


CL_ADDRESS like ’%GLADSTONE%’

12 Results
2 Distinct Client Matches
6.5 sec


'Sample Results'*:

CL_NAME CL_ADDRESS
JONES J P C/ TWEED HEADS HOTEL TWEED HEADS
JONES J P C/ TWEED HEADS HOTEL TWEED HEADS
JONES J C/- TWEED HEADS HOTEL TWEED HEADS
JONES J C/- TWEED HEADS HOTEL TWEED HEADS

* Note: Not actual customer customer results.

The above table represents an extract of four of the ten sample search results. The data set clearly shows some of the problems that are faced in performing the client de-duping process. It's possible that they all represent the same client, however the address and initials are slightly different (Note: punctuation like this can be filtered out to improve search results).

However other inconsistencies like Unit 22 238 Smith St and 22/238 Smith Street represent a larger problem that will require extensive data re-engineering and no simple solution.

Other inconsistencies in the UGC data are the presence of initials and no first name. In these two cases the initials have been entered following a different convention. Again data inconsistencies like this makes performing basic data matches even more complicated and lowers the overall data quality.

Powered by omCollab