|
Wiki Home
Members
To join, please contact us. Improve MIKE 2.0
Need somewhere to start? How about the most wanted pages; or the pages we know need more work; or even the stub that somebody else has started, but hasn't been able to finish. Or create a ticket for any issues you have found.
|
Data Re-Engineering ComponentFrom MIKE2.0 Methodology -> You are here: MIKE2.0 in the Press > Data Governance Policies > Category:Enterprise Data Management Offering Group > Current-State Logical Architecture > Data Re-Engineering Component Data Re-Engineering is a multi-step process to improve and enrich information quality that includes the set of component capabilities listed below.
Data Standardisation (Data Conditioning)Data Standardisation refers to the conditioning of input data to ensure that the data has the same type of content and format. Standardised data is important for effectively matching data, and facilitating a consistent format for output data. To enable standardisation, string-based data is typically parsed into tokens prior to matching. Based on analysis of the data and its conformance to pre-specified token sequence (called patterns), the data itself enables this tokenisation. This process identifies and corrects invalid values, standardises spelling formats and abbreviations, and validates the format and content of the data. Vendor standarisation tools can typically standardise both fixed format fields (e.g. dates, ABNs) and free-form fields (e.g. address data or name data). The following is an example of how standardisation could tokenise a record for a business. Standardisation of data can be used to redefine data attributes. For example, business structure words such as “PTY LTD” and “FAMILY TRUST” can be removed from the Name fields and inserted in a separate field; or ABNs can be removed from business names into a separate field. Standardisation can also be used to standardise commonly abbreviated words. For example, building names “CENTRE”, “CNTR”, “CENT”, CEN” can be extracted from the Address data and abbreviated as “CTR” or otherwise as required by the business rules. Data CorrectionData Correction involves fixing data issues associated with gaps in information, value problems, problems related to data freshness or state of information or data that needs to be corrected due to range issues. Correction can range from being a heavily manual process to that which involves a tool or set of reference data. Data Correction is often complex as the information has already been “lost” (e.g. we are no longer in contact with the customer; the data was never stored at the proper level of granularity). Correction can also be difficult as data that presents quality issues in one system may not necessarily have the same impact in another system, depending on how the data is used. Correction (and other types of data re-engineering) therefore typically follow the “80/20 rule” using a repetitive software development lifecycle until data has been corrected to the level that provides the most business value. Data MatchingData Matching is performed either to consolidate records (de-duplication) or to link records to form new associations. Matching is either exact or probabilistic. The benefit of exact matching is that it provides the highest degree of confidence that the records that have matched are for the exact entity. This comes at a trade-off of high technical programming costs and missed opportunities to match many records. The major benefit of probabilistic matching over and above exact matching is that it employs fuzzy logic to match fields that are similar, and hence it:
Multiple types of matching capabilities may be employed:
To acheive matching results, individual fields within these record subsets are compared using probabilistic analysis, resulting in an aggregate weight. A threshold is set above which a match is reached. A match is defined when, within the specified parameters, sufficient attributes agree to generate a score above the threshold. Attributes that disagree will reduce the aggregate weight. Weightings for each attribute are derived based on a frequency analysis of the data population fed into the match. The probabilistic matching process compares fields specified as matching fields to determine the best match for a record using comparison algorithms. Some that are commonly used include:
This comparison provides for phonetic errors, transpositions, random insertion, deletion, and replacement of characters within strings. To determine whether a record is a match, the matching tool calculates a weight for each comparison according to the probability associated with each field and sums the weights assigned to each field comparison and obtains a composite weight; the higher the composite weight, the greater the chance of a match. The match comparisons have parameters that can be configured according to the business rules. Records will be matched if their match weights exceed the target set or cut-off for the type of match. The higher cut-off indicates that there is a higher degree of confidence or greater certainty that customers are matched. Match weights can vary depending on the size of the files being matched if the tool calculates frequency distributions on match fields. The typical output from a match report contains a frequency and weight analysis for the match, including the matches, duplicates and non-matches for each pass. Data EnrichmentData Enrichment involves adding additional information to data records to fill gaps in the core set of data. The most commonly added enrichment data involves location information such as geocodes, delivery identifiers, customer contact information, personal information such as date-of-birth or gender codes, demographic information or external events. Enrichment data can be obtained from the organization’s own data or from third-party sources. Loading up of enrichment data follows a typical ETL process and is subject to data quality issues. |
Wiki asset search
Toolbox
Views
Wiki Contributors
|

