From MIKE2 Methodology
| This deliverable template is used to describe a sample of the MIKE2.0 Methodology (typically at a task level). More templates are now being added to MIKE2.0 as this has been a frequently requested aspect of the methodology. Contributors are strongly encouraged to assist in this effort.
|
| Deliverable templates are illustrative as opposed to fully representative. Please help add examples to this template that are representative of the proposed output.
|
Overview
The Perform Column Profiling task involves profiling the data found in a single column/field in either a table or a flat file. It involves analysis of simple and complex fields. Each task is done on a per-system (or subset of a system) basis.
Key steps to column profiling include:
- Understand all the fields and document their descriptions in the profiling tool
- Now that fields are understood, analyse the fields that have been identified as being of interest during the interviewing-based assessment
- Update relevant sections of the Data Quality Assessment Report with findings
- Determine additional tables which may be required to provide the desired data
Simple and Complex field profiling may be split into separate tasks.
Example 1 - Column Profiling
Listed below is a simple report of column profiling results. The results would likely be directly in a tool, this example shows the summarised output.
Summary of Data Quality Findings
This section presents the results of the investigations performed. With the stated aim of the Data Investigation being to determine the "fit for purpose" of the data held in the ABC Client and Contract tables for a XYZ solution, the overall results can be generally described as below average to average.
Many key fields are not populated, data has not been normalised and extensive [[Data Re-Engineering Component | re-engineering of the data is required to make it usable.
In particular key data elements are missing in most of the feeds (D.O.B. and Sex) which severely depletes the data's "fit for purpose" for the Client Match de-dupe process.
The column analysis was performed across all the columns in the table. However special attention should be given to the columns that allow a de-duplication process to be performed. This allows an accurate master client table to be created.
Of the specific 3rd Party provides the UGC data feed is particularly poor. It represents 67% if the total number of records but falls well short of what is required from a quality perspective.
Profiling Results
UGC Insurance
| Column
| Column Name
| % Usable values
| Number of Valid Records
| Total Number
|
| Date of Birth
| CL_DATE_OF_BIRTH
| 0%
| 0
| 371044
|
| Surname
| CL_SURNAME
| 1.99%
| 7402
| 371044
|
| First Name
| CL_FIRSTNAME
| 1.99%
| 7402
| 371044
|
| Name
| CL_NAME
| 100%
| 371044
| 371044
|
| Post Code
| CL_POSTCODE
| 100%
| 371044
| 371044
|
| Suburb
| CL_SUBURB
| 0%
| 0
| 371044
|
| Address
| CL_ADDRESS
| 100%
| 371044
| 371044
|
| Sex
| CL_SEX
| 0%
| 0
| 371044
|
| Addressee
| CL_ADDRESSEE
| 100%
| 371044
| 371044
|
| State
| CL_STATE
| 0%
| 0
| 371044
|
| Email
| CL_EMAIL_ADDRESS
| 0%
| 0
| 371044
|
| Home Phone
| CL_HOME_PHONE
| 0%
| 0
| 371044
|
| Work Phone
| CL_WORK_PHONE
| 0%
| 0
| 371044
|
Fit-for-purpose analyis : Client Match / De-Dupe
The data quality is sufficient to perform human based searches. The absence of D.O.B, sex, suburb and state make a confident match impossible. UGC Insurance also only provides initials, instead of first name. This dramatically weakens the overall use of the first name field, across all the data sources, even if those other sources have a first name field.
For example, to search by the first name "James". All UGC Insurance records can use "J" as a search value.
Fit-for-purpose analyis : XYZ Solution
From the above data we can see that the feed contains 100% CL_NAME. After an analysis of the CL_NAME data it is possible to populate CL_SURNAME from CL_NAME, however CL_FIRSTNAME cannot be obtained as only the initials are provided by UGC.
Given that CL_POSTCODE contains 100% of the values, it's also possible to use an iterative process to populate CL_SUBURB. A similar process could also be used to populate CL_STATE. Populating the Suburb and Surname fields will allow for quicker searches and result in a reduction in the number of false positives.
The lack of contact information (Email, Home and Work phone numbers) limits the current data to be only useful in performing inbound related services. No outbound activities will be possible based on the current data from UGC Insurance.
Rating: 2/5
Samples Search:
| Criteria
| Results
| Time
|
| CL_NAME like ’%JONES%'
CL_ADDRESS like ’%TWEED HEADS%'
| 10 records 3 Distinct Client Matches
| 8 sec
|
| CL_NAME like ’ %WILSON%'
CL_ADDRESS like
’%GLADSTONE%’
| 12 Results 2 Distinct Client Matches
| 6.5 sec
|
'Sample Results'*:
| CL_NAME
| CL_ADDRESS
|
| JONES J P
| C/ TWEED HEADS HOTEL TWEED HEADS
|
| JONES J P
| C/ TWEED HEADS HOTEL TWEED HEADS
|
| JONES J
| C/- TWEED HEADS HOTEL TWEED HEADS
|
| JONES J
| C/- TWEED HEADS HOTEL TWEED HEADS
|
* Note: Not actual customer customer results.
The above table represents an extract of four of the ten sample search results. The data set clearly shows some of the problems that are faced in performing the client de-duping process. It's possible that they all represent the same client, however the address and initials are slightly different (Note: punctuation like this can be filtered out to improve search results).
However other inconsistencies like Unit 22 238 Smith St and 22/238 Smith Street represent a larger problem that will require extensive data re-engineering and no simple solution.
Other inconsistencies in the UGC data are the presence of initials and no first name. In these two cases the initials have been entered following a different convention. Again data inconsistencies like this makes performing basic data matches even more complicated and lowers the overall data quality.