Open Framework, Information Management Strategy & Collaborative Governance | Data & Social Methodology - MIKE2.0 Methodology
Wiki Home
Collapse Expand Close

Members
Collapse Expand Close

To join, please contact us.

Improve MIKE 2.0
Collapse Expand Close
Need somewhere to start? How about the most wanted pages; or the pages we know need more work; or even the stub that somebody else has started, but hasn't been able to finish. Or create a ticket for any issues you have found.

Data Investigation Detailed Process Steps

From MIKE2.0 Methodology

Share/Save/Bookmark
Jump to: navigation, search

Data Investigation and Re-Engineering Process Steps provides the steps in detail for investigating and addressing data quality issues at a lower level of detail than is explained in the Overall Implementation Guide.

Contents

= Objectives of Data Profiling

  • Identify data quality issues - measurements are taken against a number of dimensions, to help identify issues at the individual attribute level, at the level table and between tables.
  • Capture metadata as a by-product of the process – Metadata is key to the success of subsequent phases of the data load including the ongoing use and maintenance of the data. Useful metadata is identified as part of the data profiling exercise and the tool-based approach will ensure it is captured in a reusable format.
  • Identify business rules – The next step is to perform the data mapping. Data profiling will assist in gaining an understanding of the data held in the system and in identifying business rules for handling the data. This will feed into the data mapping exercise.
  • To assess the ‘fitness for purpose’ of the source system data to satisfy the business requirements. This assessment is often done in the context of either a Data Warehouse, CRM, or ERP system initiative. Therefore, the focus is on gaining a very detailed understanding of the source data that will feed these target systems, to ensure that the quality level is sufficient to meet the requirements of the target system.

This section is meant to provide guidelines as to the tests that are commonly used and the approach taken to measure data quality. Each project should consider these to be a starting point only and should ensure that the specific tests that will uncover greatest value for a client are uncovered as part of the discovery process for this client. forfait sosh rio orange portabilité du numéro calcul IMC rio orange

Key Deliverables of Data Profiling

  • Data Quality Report (per Source Systems)
  • Definition of Mapping Rules and Business Rules

Different Types of Profiling

Data sources are profiled in multiple dimensions: down columns (Column Profiling); across rows (Table Profiling); and across tables (Multi-Table Profiling). The output from Data Profiling can then be used in a number of fashions, including data cleansing and mapping to a target environment.

Column Profiling

Column profiling typically examines the data found in a single column/field in either a table or a flat file. This analysis can either (1) examine the actual values found in the column (e.g. produce a complete frequency report listing each and every value found), (2) look for the various “data patterns” for the values in the column or (3) discover the underlying “parsed patterns” that might exist within more complex, free-form columns. Assessments are typically conducted to verify the validity/uniqueness, accuracy and completeness; column profiling can help answer the following questions:

  • What do the fields mean?
  • Which ones are of interest to the project?
  • What is the quality of data in each field?
  • And, given this, is the data of sufficient quality to load to into the target information environment?

Analysis on complex fields such as names, addresses, and contact information can be done to determine the patterns and consistency of the data. This provides information on the consistency of the values in a standard format and if the format is one that has been used already across similar fields on other tables within the database. From a vendor perspective, separate tools may be used to profile simple fields (ones that do not require parsing) as opposed to complex fields.

Table Profiling

Table Analysis is used to determine the key information and also relationship and dependency patterns between the data in each field within the table. It is done across every field within each table in the source system. Table Analysis is important as there are some situations where it is absolutely essential to examine values in two or more columns. A blank value in Column A might be acceptable if there is a valid value in Column B, but unacceptable if Column B is also blank. Similarly, a valid value in Column A (e.g., product code “HL”) and a valid value in Column B (e.g., sub-product code “521”) might be an invalid code combination. In such situations you must look at value combinations across columns rather than just examine values within a single column. A tools-based approach is undoubtedly the best choice for examining these cross-column or “table based” dependencies.

Multi-Table Profiling

Multi-Table Analysis looks for relationships among columns existing in different tables, rather than just within a single table. Multi-Table profiling shows the relationships between tables using key fields as links to bridge the multiple tables. The aim is to analyse and determine the existence of referential integrity issues between each table analysed, e.g. orphans and clients without address and contact information. If the profiling analysis reveals two columns in different tables share the same set of values and one is not a foreign key and the other a primary key, than redundant data exists between the two tables. For example, an employee table and an employee demographics table might both contain a “home phone number” column and the same values within these columns. Thus, an employee’s home phone number might be stored redundantly in two separate tables.

Data Quality Areas for Assessment

The reference information gathered in the previous step will uncover a wide variety of information to be used in the Tools-Based Assessment. This step is focused on designing the actual tests that will be performed. The previous step includes gaining direct business and technical feedback on data quality issues that are known to exist, including examining bug reports and change requests. This information will serve to provide significant insight into the specific tests that should be performed. This section provides an overview of the types of tests that are possible that should be considered for any Tools-Based Assessment.

The matrix below lists the primary classification of quality issues that are typically encountered in analyzing an organisation’s information assets. The matrix contains only the expected data quality issues based on previous assessment experience. The matrix indicates whether each type of assessment is typically included in the Data Quality document outlined in this document. The needs for specific test types may vary; for example, an organisation may want to verify accuracy of data through contacting customers directly or through another source. These kinds of tests can certainly be conducted but be sure that the bid has been scoped appropriate to include such tests.

Accuracy/Correctness

Description
The degree of agreement between a data value (or set of values) and a source assumed to be correct. The source may be a reference set obtained by comparison to “real world” data, or by reference to a data set on another system or file that is deemed “correct”.
Typical issues uncovered during assessment
It is expected that some names and addresses will be spelt incorrectly, or not an accurate representation of the real party, or address. Other free form text, including customer description information may not be correctly representing the party.
Effectiveness of profiling to assess this area
Accuracy is very difficult to assess without verifying customer details directly with the customer, or using a “correct” 3rd party data source. Where a reference data set is available, e.g. a list of postcodes or addresses, this may be assessed; otherwise the source system data is assumed to be accurate.

Completenesss

Description
The degree that the full values are present in the attributes that require them, and the degree that the attributes cover the user data requirements.
Typical issues uncovered during assessment
Free text fields not completed (Eg. Name, address, phone numbers, types, descriptions, null values). Target mandatory fields not populated, or present in the source system extract file. The source systems may not have all the attributes required to load the target systems.
Effectiveness of profiling to assess this area
The completeness of the field values is tested as part of the specific TBA tests. The completeness of the attributes to satisfy the user requirements are assessed in the source to target mapping documents that highlight where a target attribute cannot be populated because a source field not available

Timeliness / Currency

Description
Currency measures how up-to-date the data is, and whether the data required can be provided by the required time.
Typical issues uncovered during assessment
The source extracts may not be provided by the required time. Product, and type codes that change over time may cause products, or types assigned before a particular date for one product, or type, and those assigned after the cut-over date are a slightly different product with the same code or now have a different type.
Effectiveness of profiling to assess this area
The source system product codes are expected to be unchanged over the product life. If this assumptions does not hold the change in product code is handled during the transformation of source data in the target data acquisition.

Changes to customer type, product types, and other codes will be treated the same way as product codes. Delays in providing the source system extracts by the specified time are identified as part of the general project reporting.

Consistency / Uniqueness (No Duplicates, Integrity)

Description

Consistency is the extent that there is a single representation of data. Consistency also includes the extent that data is duplicated within a system, e.g. customer duplicates. The ability to establish the uniqueness of a data record (and data key values).

Typical issues uncovered during assessment

Duplicate customer records

Effectiveness of profiling to assess this area

Determining whether there are customer duplicates in large volume data sets require specialist tools. De-duplication also requires a dedicated data quality improvement project that includes the appropriate governance, and sponsorship as well as appropriate tools to establish and implement the de-duplication.

Validity

Description

The data is stored in an acceptable format, and is within a reasonable range of possible values. Target enforced formats, lengths, and data types not implemented in the source systems. (Eg. Date YYYY-MM-DD is stored as a character in most host systems and if these fields contain other characters they cannot be transformed into the target date format)

Typical issues uncovered during assessment

Target enforced formats, lengths, and data types not implemented in the source systems. (Eg. Date YYYY-MM-DD is stored as a character in most host systems and if these fields contain other characters they cannot be transformed into the target date format). Target codes, and types may not map 1-to-1 to the source system codes, and types. Values for some attributes are not within an acceptable range.

Effectiveness of profiling to assess this area

It is generally not practical to check all the values in every attribute loaded are the correct type, length, and format. The most important data types to assess are date, and numeric fields.

  • Check, and convert where possible source system date to the YYY-MM-DD standard target format.
  • Check the format of identification numbers.
  • Range of important balance, rate fields. This will only be an atomic value checks, it will not net accounts, and check the range.
  • Compound attribute tests. Check the value of one attribute that is dependant on another attribute.

Accessibility

Description

Ability for users to extract the existing data they require. Users must not have different interpretations of the same data.Ability for users to extract the existing data they require. Users must not have different interpretations of the same data.

Typical issues uncovered during assessment

The meaning of each file attribute, and the interrelationship between attributes, and files is difficult to determine. This requires specific product system, and business knowledge.

This may cause errors in the mapping of source fields, to target fields.

The layout, and extract process documentation may not match the actual source system files delivered.

Effectiveness of profiling to assess this area

This is a manual check of source system data, and meta data availability.

If the meta-data does not match the actual file an issue is raised.

The relevant subject matter experts will be consulted to confirm, or explain the definition of ambiguous, or conflicting attributes

Typical Investigations Performed

Column Profiling (Simple Fields)

Column Profiling of Simple Fields is most typically done to check for Completeness, Uniqueness and Validity. An example would be the Account Number field within the Client_Accounts table. The Account Number column is analysed for:

Column Profiling (Simple Fields)
FieldTestDescription
Account Number CompletenessAll records contain a non-zero value
UniquenessAll records contain a different account number
ValidityEnsure all account numbers are numeric or follow the source systems account number structure

Column Profiling (Complex Fields)

Column Profiling of Complex Fields is most typically conducted to verify the validity/uniqueness, format accuracy and completeness. An example would be looking at Customer Records:

Column Profiling (Complex Fields)
FieldTestDescription
Multiple FieldsCompletenessAs per Client Name. All the name fields should contain a title, firstname, and lastname. Not all person names contain middle names and suffixes such as Jnr etc.
UniquenessPresence of Title at start of name

Has a first name, and Has a last name

ValidityNo characters, ie: #,@% present in all the name fields.

Table Profiling

Table Profiling is most typically conducted to verify the validity/uniqueness, format accuracy and completeness. An example would be looking at Customer Records:

Table Profiling
FieldTestDescription
Customer IDCompletenessIs a unique value populated within this field to uniquely identify each customer record? If so is there a high population ratio on this field, ie: 100% populated.
UniquenessPresence of a Primary Key or Foreign Key in relational databases. Are these keys unique to each record?
ValidityAre these key fields populated with valid value, ie: unique values?
FormatAre these values in a consistent format, ie: numeric or does it contain leading alphas etc.

Multi-Table Profiling

Multi-Table Profiling is conducted to verify the validity/uniqueness and accuracy of the key fields to ensure referential integrity. An example would be to assess the Customer Table vs. a Customer Loan Table

Multi-Table Profiling
Customer Table:
FieldTestDescription
Customer IDCompletenessIs a unique value populated within this field to uniquely identify each customer record? If so is there a high population ratio on this field, ie: 100% populated.
UniquenessPresence of a Primary Key or Foreign Key in relational databases. Are these keys unique to each record?
ValidityAre these key fields populated with valid value, ie: unique values?
FormatAre these values in a consistent format, ie: numeric or does it contain leading alphas etc.

Customer Table:

FieldTestDescription
Customer IDCompletenessIs a unique value populated within this field to uniquely identify each customer record within the Customer table? If so is there a high population ratio on this field, ie: 100% populated.
UniquenessPresence of a Primary Key or Foreign Key in relational databases. Are these keys unique to each record?
ValidityAre these key fields populated with valid value, ie: unique values?
FormatAre these values in a consistent format, ie: numeric or does it contain leading alphas etc.


Some vendors have multiple profiling products that are used to fulfil these functions.

The Iterative Profiling Process

Data Profiling is an Iterative Process

It is important to note that there is an iterative approach to profiling within each of the analysis steps below. Each step involves:

  • Running the analysis within the appropriate Vendor Tool
  • Analysing the results of each analysis
  • Verify the results with the Source System SME
  • Documenting the results (both in deliverables and within the profiling tools)
  • Plan further analysis based on results

The data investigation process should verify what the source system owners say about the data against the actual data; it should also verify what the data says against what the source system owners say.

This process is iterative – for example, a source system owner may say that all customer names must have a full first name and full surname. However, when this rule is checked against the data, this shows that 10% of the records have only a first initial. In this case, this must be discussed with the source system owner. This type of anomaly may be explained by a business rule that was applied to new data that was not applied to historical data. Further analysis is performed in this case to verify that all anomalous records were created before the expected data. Data Re-Engineering also follows and iterative process for standardize, correct, match and enrich data.

Wiki Contributors
Collapse Expand Close