27 Dec 2010
Tweety Bird and Aha! Moments
About three months ago, I started a data management and ETL project for a pretty big bank. Time was of the essence and the bank brought me in because I can get results. In this post, I explain why an overemphasis on results can be a really bad thing–and why all matching isn’t created equal.When my client advised me of the number of disparate extracts of financial data I would receive, I quickly looked for potential commonalities. As most would do on this type of project, I started with the obvious: GL account. While not unique in most systems I have seen, they are often at least part of a multi-field index. In other words, I can almost always sum debits and credits by GL account 1000, for example, especially when I bring in a field like company.
Dealing with Suboptimal Data and Other Limitations
Unfortunately, a few of the extracts only contained account descriptions. When I explained this limitation to my boss, we had the following exchange:
Boss: Well, why not join on description instead of account number?
Me: It’s possible, but I never recommend it. In many systems, these descriptions are not standardized. One missing character, misplaced space, or mutant letter will result in missing data downstream and other problems that are worth trying to address from the beginning.
Boss: We have no other choice. IT won’t change the extract from XYZ financial system. Gotta make do….
Me (in Tweety Bird voice): OK…but I got a baaaaaad feeling about this.
Fast forward to the end of the project. Our reports were off–way off. What’s more, this wasn’t a fundamental design issue. So far off that I had to rejigger a bunch of previously working queries and validation routines. This took a few days and caused other problems. We had to break things that previously worked.
Simon Says
Sometimes we don’t have a choice in life and in golf. We hit the ball into the woods, can’t find it, have to take a penalty stroke, and walk away with a snowman on a par 3. It happens.
Although my client wasn’t happy that we had so many problems, he fully understood what I told him on day one: “Try not to join on descriptions.” Of course, the bell didn’t ring quite as loud when I initially said that. After he saw the results first-hand, that bell was pretty loud–and wouldn’t stop.
Understand that there are limitations with certain types of data matching, as many on this blog has pointed. Some types are much better than others. Some are too restrictive; others are far too liberal. Think about it.
As one of my friends put so nicely, you don’t go to pick up your kid at nursery school and just grab any eight-year old brunette girl. You probably want you own kid.
Data’s the same way.
Feedback
What say you?


January 3rd, 2011 at 10:41 am
I’m confused. Were you trying to match accounts by using the descriptive text that describe the account fields?
January 3rd, 2011 at 7:10 pm
Yes. I had no other choice.
January 4th, 2011 at 12:50 am
In BI, sometimes you gotta do what you gotta do, and get by with imperfect data. But it sounds like you recognized early on that this tactic wasn’t just poor practice, it was likely to make the project fail.
In cases like this, you have to put your foot down and insist on data fixes. You can either be the bearer of bad tidings now, or you can be scapegoated later. It looks much worse, three months down the line, to say, “Yes, the tactic failed, but look, I told you it was going to!” You wind up like the mythological Cassandra: correct in your prediction, but ignored and shunned.
Be honest about what you need to succeed, especially if it costs more than what the powers that be expected. You’ll garner more respect in the end if you can actually deliver what’s promised.
January 4th, 2011 at 12:51 am
Fixing the data wasn’t an option, Branden. Ours is an imperfect world. I had to settle for imperfection.
January 4th, 2011 at 2:28 am
I don’t know all the facts but matching on definitions is not only ill advised but practically useless. There are few if any guidelines that are used when people write definitions. I have worked on trying to harmonize data from multiple systems using metadata such as data element names, definitions and structure and once you get into the weeds, you come to realize there are frequently subtle and not so subtle differences in the data. We use semantic analysis to harmonize data and that helps to expose the differences, which are many.
I agree with Branden. I’ve had clients where I had to intervene which in some cases ended up shortening my engagement but I have an aversion to doing things that I feel are ill advised.
I’m not as much concerned of being scapegoated or conveying the “I told you so message”, as much as I am concerned about the integrity of my work.
I’m not sure that imperfection is prudent. We may live in a world of imprecision but we should strive for greater accuracy.
I’ve made plenty of mistakes myself but I try to avoid having my clients make mistakes. If they insist however, I have no aversion to telling them to go elsewhere.
January 4th, 2011 at 2:36 am
It’s a great discussion. If I were the client, I would ask that data be cleaned. I explained the downfalls and shortcomings of joining by name but the company wanted to proceed. I voiced my concerns and they noted them.
January 20th, 2011 at 2:30 pm
It is so nice to know that I’m not the only one who has encountered this incredibly frustrating situation. You did the best you could do with what you were given. It sounds like the client needed things to be resolved fast and they clearly underestimated your warning.
As you put it : “Of course, the bell didn’t ring quite as loud when I initially said that. After he saw the results first-hand, that bell was pretty loud–and wouldn’t stop.”
This is the kind of wakeup call that will actually get clients to realize the importance of standardizing data entry. I’m pretty confident that if you check back with them in a few months you will find that they have addressed that issue.
January 20th, 2011 at 6:40 pm
Natalie
It’s one thing when I say something “in the abstract.” It’s another when they actually get their arms around it. So, yes, I agree with you: speed was imperative at first, not “doing it the right way.”
April 30th, 2012 at 7:37 pm
Its superb as your other posts : D, appreciate it for putting up. “A single day is enough to make us a little larger.” by Paul Klee.