Why is Information Management so complex?
Tuesday, August 7th, 2007I was speaking with a client this week who put forward the challenge that Information Management isn’t really as complicated as we in the profession make out. I stopped for a moment to think about how I could explain the intricacy of an entire body of practice and realised that I would need to pick just one example.
Given its prominence in the industry, I decided to use Master Data Management and particularly the process of matching between sets of master data.
I started with just two lists of people (set A and set B). I then explained how a typical algorithm would match individual records by creating a score and a threshold for matching. No problem my client said, he could use a spreadsheet for that!
I then added a third list (set C). Most algorithms compare two lists at a time. That means there are three combinations: AB followed by ABC, AC followed by ACB, and BC followed by BCA. To see why it matters, consider the following situation.
In set A, we have a record: “Robert Hillard, email robert.hillard[at]bearingpoint.com”
In set B, we have a record: “Robert Hillard, phone number +61 412 396 036”
In set C, we have a record: “Robert Hillard, phone number +61 412 396 036, email: robert.hillard[at]bearingpoint.com”
A typical business rule might require two items of data to match before the threshold is reached. That means we need name and email, name and phone number or email and phone number to define a match.
In the first scenario we match AB first followed matching the resulting records with set C. In this example, the two “Robert Hillard” records are not matched in the first pass meaning on the second pass when we bring in set C we can only end up with at best two records when we match the two entries to the new Robert Hillard in set C. The final result is two instances of Robert Hillard.
In the second scenario we match AC first which results in a full match on Robert Hillard, which in turn when set B is brought in matches to the instance in that file as well. The final result is just one instance of Robert Hillard.
Now understanding the complexity, my client tried to add a kludge solution by creating a master record for each match during an individual pass. There isn’t enough space in this posting to explain why this doesn’t help as the number of sets increases, however suffice it to say that each such band aid solution actually adds to the complexity when more sets are added.
In summary, the more sets there are to match the more combinations there are which will affect the outcome. For n sets there are, in fact (n-1)! (ie., n minus 1 factorial) combinations each of which will usually give a different final result for a statistically significant number of entries. Imagine the problem facing the US government when trying to bring together lists of doctors, lawyers or other professionals across 50 state lists!

