16 Sep 2011
Anyone who has ever worked with vast amounts of data has, at one point, had to deal with the issue of matching. In a nutshell, data from one system or system(s) needs to be fused with data from another system or system(s). Only a neophyte believes that this is an easy task with a data set of any decent size. Newbies ask, “Well, if we have 10,000 employees or products in one application, how hard can it be to link them to 10,000 in another?”
And then IT folks cringe.
In this post, I’d like to offer a primer on matching.
The business case for matching is typically straightforward. At some point and at a high level, organizations need to make sense of their data. To be sure, MDM tools exist that allow organizations to maintain different sets of data in different systems (facilitated by a master list). Yes, this can minimize the need for a complete data cleanup endeavor. But MDM does not eliminate the need for organizations to match their data. In fact, MDM may even enhance that need, forcing organizations to match different sets of data to one “master” list. Here, master records would serve as a bridge, crosswalk, or translate (XLAT) table to different data sets.
For instance, John Smith in an applicant tracking system (ATS) needs to be matched with John Smith in an ERP. We assume that both John Smith’s are one and the same but, as we’ll see in this post, that’s not always the case.
Types of Matching
Delving more into the rationale behind matching Henrik L. Sørensen recently writes on his well-trafficked blog that “matching party data – names and addresses – is the most common area where Data Matching is practiced.” He goes on to write that different types of matching include:
- Match with external reference data
- Identity Resolution
To continue with our example, five recruiters at Acme, Inc. might have entered John Smith five separate times in the ATS. This is a big no-no, but I’ve met many HR folks who don’t care too much about data quality and downstream impacts of superfluous transactions. In this case, John Smith needs equal parts deduplication and identify resolution.
Two Approaches: Exact vs Probabilistic Matching
We think that we ultimately need one John Smith throughout Acme’s systems, but how do we get there? If John Smith’s social security (123-45-6789) number is the same in both systems, then we can feel pretty confident that it’s the same person and we can remove or consolidate extraneous records. This is an exact match.
But what if we’re missing some data? What if a number is transposed? Let’s say that one record lists is SSN as 123-45-6798. Is this the same John Smith?
Here, exact matching fails us and we have to look at other means, especially if we’re dealing with tens of thousands of records and up. (In the case of one person, we can always ask him!) Here, we should turn to the second type of matching: probabilistic matching.
The best definition I’ve found on the topic comes from Information Management. It defines probabilistic matching as a method that
uses likelihood ratio theory to assign comparison outcomes to the correct, or more likely decision. This method leverages statistical theory and data analysis and, thus, can establish more accurate links between records with more complex typographical errors and error patterns than deterministic systems.
Translation: the law of large numbers is put to use to ascertain relationships that are very likely to exist. The results can be astounding. I’ve heard of tools that use math to produce results that are more than 99.9% accurate.
Benefits of Probabilistic Matching
The Mike 2.0 Offering shows that major benefit of probabilistic relative to exact matching is that the former employs fuzzy logic to match fields that are similar. As a result, it:
- Allows matching of records that have transposition or spelling errors and therefore obtain a significant increase in matches over systems using purely string comparisons.
- Standardizes data in free-form fields and across disparate data sources.
- Uncovers information buried in free-form fields and identifying relationships between data values.
- Provides survivorship of the best data within and between source systems.
- Does not require extensive programming to develop matches based on simple business requirements.
Software that matches people on probabilities will link John Smith (123-45-6798) with John Smith (123-45-6789). It’s the same guy.
Of course, the best way to match your data is to manage it properly from the get go. Even basic ATSs allow organizations to establish business rules that would minimize the likelihood of our two John Smiths. (Prohibiting applicants with the same SSN comes to mind.)
But minimize and eliminate are two different things. End users indifferent to data quality beat systems rife with business rules any day of the week–and twice on Sunday. A culture of data governance is the end-game, but you might need to employ some matching techniques to triage some data issues before you get to the Holy Grail.
What say you?