Personal tools

Partners

Data Re-Engineering Component

From MIKE2 Methodology

Jump to: navigation, search

Data Re-Engineering is a multi-step process to improve and enrich information quality that includes the set of component capabilities listed below.

Contents

Data Standardisation (Data Conditioning)

Data Standardisation

Data Standardisation refers to the conditioning of input data to ensure that the data has the same type of content and format. Standardised data is important for effectively matching data, and facilitating a consistent format for output data.

To enable standardisation, string-based data is typically parsed into tokens prior to matching. Based on analysis of the data and its conformance to pre-specified token sequence (called patterns), the data itself enables this tokenisation.

This process identifies and corrects invalid values, standardises spelling formats and abbreviations, and validates the format and content of the data. Vendor standarisation tools can typically standardise both fixed format fields (e.g. dates, ABNs) and free-form fields (e.g. address data or name data). The following is an example of how standardisation could tokenise a record for a business.

Standardisation of data can be used to redefine data attributes. For example, business structure words such as “PTY LTD” and “FAMILY TRUST” can be removed from the Name fields and inserted in a separate field; or ABNs can be removed from business names into a separate field. Standardisation can also be used to standardise commonly abbreviated words. For example, building names “CENTRE”, “CNTR”, “CENT”, CEN” can be extracted from the Address data and abbreviated as “CTR” or otherwise as required by the business rules.

Data Correction

Data Correction

Data Correction involves fixing data issues associated with gaps in information, value problems, problems related to data freshness or state of information or data that needs to be corrected due to range issues. Correction can range from being a heavily manual process to that which involves a tool or set of reference data.

Data Correction is often complex as the information has already been “lost” (e.g. we are no longer in contact with the customer; the data was never stored at the proper level of granularity). Correction can also be difficult as data that presents quality issues in one system may not necessarily have the same impact in another system, depending on how the data is used.

Correction (and other types of data re-engineering) therefore typically follow the “80/20 rule” using a repetitive software development lifecycle until data has been corrected to the level that provides the most business value.

Data Matching

Data Matching

Data Matching is performed either to consolidate records (de-duplication) or to link records to form new associations. Matching is either exact or probabilistic.

The benefit of exact matching is that it provides the highest degree of confidence that the records that have matched are for the exact entity. This comes at a trade-off of high technical programming costs and missed opportunities to match many records.

The major benefit of probabilistic matching over and above exact matching is that it employs fuzzy logic to match fields that are similar, and hence it:

  • Allows matching of records that have transposition or spelling errors and therefore obtain a significant increase in matches over systems using purely string comparisons.
  • Standardises data in free-form fields and across disparate data sources.
  • Uncovers information buried in free-form fields and identifying relationships between data values.
  • Provides survivorship of the best data within and between source systems.
  • Does not require extensive programming to develop matches based on simple business requirements.

Multiple types of matching capabilities may be employed:

  • One-to-One. This matching process identifies all records on one file that correspond to a record for the same individual, event, household, street address, etc., on the second file. Only one record on File B can match a single record on File A.
  • One-To-Many Matching. For one-to-many matching processes, a single record on File A can match to multiple records on File B.
  • De-Duplication. Matches records in a single file that apply to the same individual, household, event, etc. For these procedures, only File A is used. Matches found can be in groups, rather than pairs. Groups are identified by a group ID and can be automatically accepted as a match, or manually reviewed if required

To acheive matching results, individual fields within these record subsets are compared using probabilistic analysis, resulting in an aggregate weight. A threshold is set above which a match is reached. A match is defined when, within the specified parameters, sufficient attributes agree to generate a score above the threshold. Attributes that disagree will reduce the aggregate weight. Weightings for each attribute are derived based on a frequency analysis of the data population fed into the match.

The probabilistic matching process compares fields specified as matching fields to determine the best match for a record using comparison algorithms. Some that are commonly used include:

  • CHAR Comparison: A character-by-character comparison. If one field is shorter than the other, the shorter field is padded with trailing blanks to match the length of the longer field. Any mismatched character causes the disagreement weight to be assigned.
  • NAME_UNCERT Comparison: Compares first names, where one name might be truncated. This comparison uses the shorter length of the two names for the comparison and does not compare any characters after that length.
  • NUMERIC Comparison: An algebraic numeric comparison. Leading spaces are converted to zeros and the numbers are compared.
  • UNCERT Comparison: A character comparison that uses an information-theoretic character comparison algorithm to compare two character strings.

This comparison provides for phonetic errors, transpositions, random insertion, deletion, and replacement of characters within strings.

To determine whether a record is a match, the matching tool calculates a weight for each comparison according to the probability associated with each field and sums the weights assigned to each field comparison and obtains a composite weight; the higher the composite weight, the greater the chance of a match. The match comparisons have parameters that can be configured according to the business rules.

Records will be matched if their match weights exceed the target set or cut-off for the type of match. The higher cut-off indicates that there is a higher degree of confidence or greater certainty that customers are matched. Match weights can vary depending on the size of the files being matched if the tool calculates frequency distributions on match fields. The typical output from a match report contains a frequency and weight analysis for the match, including the matches, duplicates and non-matches for each pass.

Data Enrichment

Data Enrichment

Data Enrichment involves adding additional information to data records to fill gaps in the core set of data. The most commonly added enrichment data involves location information such as geocodes, delivery identifiers, customer contact information, personal information such as date-of-birth or gender codes, demographic information or external events.

Enrichment data can be obtained from the organization’s own data or from third-party sources. Loading up of enrichment data follows a typical ETL process and is subject to data quality issues.

Powered by omCollab