“When the amount of data in the world increases at an exponential rate, analyzing that data and producing intelligence from it becomes very important,” says Anand Rajaraman, senior vice-president of global e-commerce at Wal-Mart and head of @WalmartLabs, the retailer’s division charged with improving its use of the Web.
More than ever, today intelligent businesses are trying to make sense of millions of tweets, blog posts, comments, reviews, and other form for unstructured data. The obvious question becomes, “How?”
which customers buy (and like) other apps based on the purchase of the first app
which customers are more likely to consider buying apps in the same category
Few businesses have the same level of knowledge about their customers. Apple is the exception that proves the rule. In other words, rare is the organization with access to detailed data on millions of its customers, structured or otherwise. What’s a “normal” company to do? Is there nothing they can do but watch from the sidelines?
In a word, no. These emerging applications show tremendous promise.
Now, I won’t pretend to have intimate knowledge of each of these data-mining applications and projects. At a high level, they are designed to help large organizations interpret vast amounts of data. Clearly, developers out there have recognized the need for such applications and have built them according to what they think the market wants.
One application equipped to potentially make sense of all of this unstructured data is Hadoop, an open source development project. From its website, “The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model.” Certainly worth looking into.
The Benefits of Free
Imagine for a moment that you’re a mid-level manager in a large organization. You see the need for a tool that would help you mine data and attempt to find hidden patterns–and ultimately knowledge. Excel just doesn’t cut it. You’d love to see what Hadoop or one of its equivalents can do. Yet, you’re not about to fall on the sword for an unproven product in tight economic times.
Free and open source tools are certainly worth considering. Download Hadoop or another application and begin playing with it. Network with people online and off. Ask them questions. Noodle with different data sets and see if you learn anything about your company, its customers, and underlying trends and drivers. Worst case scenario: you waste a little time.
In any organization, the traditional RFP process can be extremely cumbersome. Times are tight, and it’s entirely possible that even potentially valuable projects like mining unstructured data may not get the go-ahead. What’s more, your organization may have established relationships with companies like IBM that offer proprietary applications and services in the BI space. And, to be sure, Hadoop and other open source/free tools may not meet all of your organization’s needs.
All of this is to say that open source software is no panacea in any case–and here is no exception. However, doesn’t it behoove you to see what’s out there before making the case for a large expenditure–one that may ultimately not succeed? Is there any real harm in downloading a piece of software just to see what it can do?
Since the 1980s the costs associated with functions that are shared have been increasingly allocated to business units in such a way as to drive accountability.
For information technology this was relatively easy in the late 1980s as the majority of costs were associated with the expense of infrastructure or processing. Typically the cost of the mainframe was allocated based on usage. Through the 1990s, costs moved increasingly to a project focus with a model that encouraged good project governance and the allocation of infrastructure based on functions delivered.
Arguably, the unfortunate side effect of the allocation of project costs has been that many business units see information technology as being unnecessarily expensive – whereas many of the costs are really just reflecting the sheer complexity of business technology (see my previous post: CIOs need to measure the right things). Such an approach to cost allocation has allowed business units to execute projects of increasing sophistication; however it may not be ensuring that information technology is being used in the way that will achieve the greatest possible strategic impact.
The other problem with the project-focused approach to cost recovery is that the CIO’s role is diminished to being that of a service provider. In some organisations this has gone so far as to result in the CIO seeing external service providers as competitors.
Refocusing the cost recovery to the value achieved has the potential to deal the CIO back into the strategic debate. As I’ve said before, information technology is extremely complex and requires experience and insight in order to identify the real opportunities to use it effectively to support and differentiate any business. During the next decade, we are likely to see the continued blurring of the lines between internal business technology, joint ventures and the products that consumers use. For instance, joint venture partners expect to see detailed financial reports across boundaries and consumers are used to helping themselves through the same interfaces that were previously restricted to call centre operators.
Recently on the web there has been some discussion on whether information should be valued. At the same time there has been good progress in the development of techniques to value information assets (for instance, see the MIKE2.0 article on Information Economics). The value of information is a very good way of predicting likely business value, even when the way that value will be realised has not yet been determined. The disruption to previously stable businesses, such as retail, telecommunications and manufacturing, are very good examples of why it is important to understand value beyond the horizon of current revenue models.
Allocating at least some of the cost of an effective information technology department based on value focuses the budget debate on the development of revenue earning products that will leverage these new capabilities. It also ensures that those units receiving the greatest potential value are motivated to realise it. Finally, the move away from a largely activity-based approach to measuring cost reduces the tendency to continually cut project scopes to keep within defined budget.
In my last post, I discussed matching at a relatively high level. I concluded the post with a reference to software that uses some pretty sophisticated math to match based upon probabilities. Netrics was one such company and was recently acquired by Tibco.
In this post, I’d like to discuss what to do when you’re under the gun. That is, your organization is in the midst of some type of data cleanup project on a burning plank. Maybe your consultants didn’t tell you about the difficulties to be expected migrating data. In any event, you don’t have the time to evaluate different matching applications, talk to different vendors, test the application, and the like. You need to solve a problem now, if not sooner.
Note that this post assumes that matching by simple inner joins is not possible, an example of which is shown below. That is, there’s no common record among the tables that would allow for easy consolidation and matching.
In the figure above, tbl_Student.StudentID should tie directly to tbl_EligibilityStatus.EligabilityStatusID. A record in one table should match a record in the other (the operative word here being should.)
So, what’s a data management professional to do when we don’t have employee number, product ID, or customer number consistently populated among our different systems and tables? While hardly as powerful or accurate as the applications dedicated to solving these problems, it’s not impossible to manipulate data in such a way as to increase the probability of your matches. In the past, I’ve used SQL statements and other tricks to serve as the basis for my data matching.
Consider the following table:
Now, if social security number (SSN) were consistently defined and populated throughout our tables, systems and applications, matching would be a piece of cake. Lamentably, this is often not the case. As a result, to match, we have to get creative.
I could concatenate different fields as follows, creating what appears to be a unique value for each record (Concat):
100 Main Street
74 Rush Street
The problem with this type of concatenation, however, is that it assumes that every table, system, and application has all of these fields. this may be a faulty assumption. For instance, what if an applicant tracking system (ATS) does not include an employee’s job code? That’s certainly possible because the employee hasn’t been hired yet!
So, let’s revise our attempt to create a primary key to remove job code:
100 Main Street
74 Rush Street
This seems like a more reasonable approach, as many systems contain at least this basic data on employees. (Whether it’s the correct address, name, SSN, etc. is another matter.) John Smith might be listed as follows:
John Smith (note the extra space between the first and last names, something that can be eliminated with TRIM statements)
John P. Smith
I could go on, but you get my drift. This approach to data matching is by no means guaranteed. It’s at least a starting point and, depending on the data at my disposal, I’ve used it to obtain accuracy levels of approximately 90 percent. So, if I have 1,000 records that need to be matched, at least I’ve given my end users only 100 to look at and individually verify.
One caution with this method: You may wind up with a few false positives–i.e., records that the above method claims match but in fact do not.
Sometimes software isn’t an option and there are far too many records to match than we can handle manually. In these instances, we have to improvise. Use some of the techniques described in this post to expedite your matching with a fair degree of accuracy. Check a few of the matches to ensure that they make sense. While not perfect, methods like these can save time and money.
Rick Sherman is the founder of Athena IT Solutions which focuses on maximizing ROI from business intelligence. He is an experienced consultant in areas such as data warehouse and business intelligence assessments and implementation, onsite data warehouse training, and often helps industry vendors with product assessments and marketing support.
Anyone who has ever worked with vast amounts of data has, at one point, had to deal with the issue of matching. In a nutshell, data from one system or system(s) needs to be fused with data from another system or system(s). Only a neophyte believes that this is an easy task with a data set of any decent size. Newbies ask, “Well, if we have 10,000 employees or products in one application, how hard can it be to link them to 10,000 in another?”
And then IT folks cringe.
In this post, I’d like to offer a primer on matching.
The business case for matching is typically straightforward. At some point and at a high level, organizations need to make sense of their data. To be sure, MDM tools exist that allow organizations to maintain different sets of data in different systems (facilitated by a master list). Yes, this can minimize the need for a complete data cleanup endeavor. But MDM does not eliminate the need for organizations to match their data. In fact, MDM may even enhance that need, forcing organizations to match different sets of data to one “master” list. Here, master records would serve as a bridge, crosswalk, or translate (XLAT) table to different data sets.
For instance, John Smith in an applicant tracking system (ATS) needs to be matched with John Smith in an ERP. We assume that both John Smith’s are one and the same but, as we’ll see in this post, that’s not always the case.
Types of Matching
Delving more into the rationale behind matching Henrik L. Sørensen recently writes on his well-trafficked blog that “matching party data – names and addresses – is the most common area where Data Matching is practiced.” He goes on to write that different types of matching include:
Match with external reference data
To continue with our example, five recruiters at Acme, Inc. might have entered John Smith five separate times in the ATS. This is a big no-no, but I’ve met many HR folks who don’t care too much about data quality and downstream impacts of superfluous transactions. In this case, John Smith needs equal parts deduplication and identify resolution.
Two Approaches: Exact vs Probabilistic Matching
We think that we ultimately need one John Smith throughout Acme’s systems, but how do we get there? If John Smith’s social security (123-45-6789) number is the same in both systems, then we can feel pretty confident that it’s the same person and we can remove or consolidate extraneous records. This is an exact match.
But what if we’re missing some data? What if a number is transposed? Let’s say that one record lists is SSN as 123-45-6798. Is this the same John Smith?
Here, exact matching fails us and we have to look at other means, especially if we’re dealing with tens of thousands of records and up. (In the case of one person, we can always ask him!) Here, we should turn to the second type of matching: probabilistic matching.
The best definition I’ve found on the topic comes from Information Management. It defines probabilistic matching as a method that
uses likelihood ratio theory to assign comparison outcomes to the correct, or more likely decision. This method leverages statistical theory and data analysis and, thus, can establish more accurate links between records with more complex typographical errors and error patterns than deterministic systems.
Translation: the law of large numbers is put to use to ascertain relationships that are very likely to exist. The results can be astounding. I’ve heard of tools that use math to produce results that are more than 99.9% accurate.
Benefits of Probabilistic Matching
The Mike 2.0 Offering shows that major benefit of probabilistic relative to exact matching is that the former employs fuzzy logic to match fields that are similar. As a result, it:
Allows matching of records that have transposition or spelling errors and therefore obtain a significant increase in matches over systems using purely string comparisons.
Standardizes data in free-form fields and across disparate data sources.
Uncovers information buried in free-form fields and identifying relationships between data values.
Provides survivorship of the best data within and between source systems.
Does not require extensive programming to develop matches based on simple business requirements.
Software that matches people on probabilities will link John Smith (123-45-6798) with John Smith (123-45-6789). It’s the same guy.
Of course, the best way to match your data is to manage it properly from the get go. Even basic ATSs allow organizations to establish business rules that would minimize the likelihood of our two John Smiths. (Prohibiting applicants with the same SSN comes to mind.)
But minimize and eliminate are two different things. End users indifferent to data quality beat systems rife with business rules any day of the week–and twice on Sunday. A culture of data governance is the end-game, but you might need to employ some matching techniques to triage some data issues before you get to the Holy Grail.
Yogi Berra famously once said, “When you come to a fork in the road, take it.” In this post, I’ll discuss a few different paths related to data quality during data migrations.
But let’s take a step back first. Extract-Transform-Load (ETL) represents a key process in any information management framework. At some point, at least some of an organization’s data will need to be taken from one system, data warehouse, or area, transformed or converted, and loaded into another data area.
The entire MIKE2.0 framework in large part hinges on data quality. DQ represents one of the methodology’s key offerings, if not its most important one. To this end, it’s hardly unique as an information management framework. But, as Gordon Hamilton and my friend Jim Harris pointed out recently on an episode of OCDQ Radio, not everyone is on the same page when it comes to when we should actually clean our data. Hamilton talks about EQTL (Extract-Quality-Transform-Load), a process in which data quality is improved in conjunction with an application’s business rules. (Note that there are those who believe that data should only be cleaned after it has been loaded into its ultimate destination–i.e., that ETL should give way to ELT.)
Why does this matter? Many reasons, but I’d argue that implicit in any decent information management framework is the notion of data quality. While many frameworks, models, and methodologies vary, it’s hard to imagine any one worth its salt that pooh-poohs accurate information. (For more on a different framework, see Robert Hillard’s recent post.)
Data Quality Questions
Different frameworks aside, consider the following questions:
As data management professionals, should we be concerned about the quality of the data as we are transforming and loading it?
Or can our concerns be suspended during this critical process?
And the big question: Is DQ always important?
I would argue yes, although others may disagree with me. So, for the organization migrating data, when is the best time for conduct the cleansing process? Here are the usual suspects:
Organizations can often save an enormous amount of time and money if they cleansed their data before moving it from point A to point B. To be sure, different applications call for different business rules, fields, and values. Regardless, it’s a safe bet that a new system will require different information than its predecessor. Why not get it right before mucking around in the new application?
Some contend that this is generally the ideal point to cleanse data. Retiring a legacy system typically means that organizations have decided to move on. (The term legacy system implies that it is no longer able to meet a business’ needs.) As such, why spend the time, money, and resources “fixing” the old data. In fact, I’ve seen many organizations specifically refuse to cleanse legacy data because their end users felt more secure with the knowledge that they could retrieve the old data if need be. So much for cutting the cord.
If your organization takes this approach, then it is left with two options: cleanse the data during or post-migration. If given the choice, I’d almost always opt for the former. It’s just easier to manipulate data in Excel, Access, ACL, or another tool than it is in a new system loaded with business rules.
The ease of manipulating data outside of a system is the very reason that many organizations prefer to clean their in a system of record. Excel doesn’t provide a sufficient audit trail for many folks concerned about lost, compromised, or duplicated data. As such, it makes more sense to clean data in the new system. Many applications support this by allowing for text-based fields that let users indicate the source of the change. For instance, you might see a conversion program with a field populated with “data change – conversion.”
Is this more time-consuming? Yes, but it provides greater audit capaability.
John Morris is an experienced Data Migration professional with a 25 year history in information technology.
For the last ten years he has specialised solely in delivering Data Migration projects. During that time he has worked on some of the biggest migration projects in the UK and for some of the major systems integrators.
John is the author of “Practical Data Migration” the top selling book on data migration published by the BCS (British Computing Society).
More and more organizations are creating formal positions for the Chief Data Officer (CDO)–aka, the Data Czar. In this post, I’ll discuss the role and some of the challenges associated with it.
Background: The What and the How
The notion of a CDO is not exactly new. In fact, credit card and financial services company Capital One hired one back in 2003. (Disclaimer: I used to work for the company).
At a high level, the CDO is supposed to play a key executive leadership role in driving data strategy, architecture, and governance. S/he is the executive leader for data management activities. So, we know what the CDO supposed to do, but how exactly is s/he supposed to accomplish these often delicate tasks? It’s not a simple answer.
…there is not yet a script for implementing the CDO. Organizations need to experiment to figure out how it will work best. I think this was and is the most important observation. And, very likely different organizations will adopt different directions based on the roles data play, to satisfy customers, conduct operations, make decisions, innovate, and so forth. One would expect different strategies to lead to different organizational forms.
Redman is right in pointing out the myriad waters that a CDO has to navigate. The CDO is by no means a simple job and touches many parts of the organization.
Who and Why?
Rooted in this question of “What does the CDO do?” are two related queries:
To whom does the CDO report?
Why is this person here?
In fact, some people openly question the basic need for a CDO. By way of contrast, an organization’s CFO or Chief People Officer (CPO) occupy more established roles, even if the latter’s title has changed. Does anyone really doubt the need for a head of HR and/or finance?
Few organizations place the CDO on par with the CTO or CIO. It’s not uncommon for the CDO to report to one of the other “C’s.” To this end, should one “C” report to another? Is this just title inflation?
I can see the need for an effective CDO at a large, multi-national organization lacking discipline with respect to its data management practices. You could certainly make the case for a Data Czar when an organization’s culture is miles away from meaningful data governance.
But it’s hard for me to imagine the CDO as a must-have role for every type or organization. Even a mid-market company might be hard-pressed to justify the expense. A well-run conglomerate can probably get by without one.
To be sure, the CDO has the ability (responsibility?) to look at things from a much more global perspective than busy line of business (LOB) users stuck in the trenches. And, at least to me, this is part of the fundamental challenge for the CDO: How do they manage the IT-business chasm? End users create, modify, and purge data. Data shouldn’t be the responsibility of “the Data Department” or even IT.
CDOs certainly have their work cut out for them. They probably only exist in companies facing very complicated and significant data challenges. As such, they must strike the appropriate balance between:
solving immediate data-related problems; and
understanding high-level data and information management issues
By the same token, though, CDOs cannot be seen as the police if they want to be effective. People tend to dread calls and meetings with those critical of their practices–with no understanding of the business issues at play.
Gartner is reporting today that the social customer relationship management (CRM) market is forecast to reach over $1 billion in revenue by year-end 2012. A surprising figure considering this accounts for less than 5 percent of the total CRM application market. Based on these and other growing trends in CRM, many IT professionals have begun to comment on the considerations in making this strategic decision between Hosted and In-House CRM.
Hosted CRM has been used to describe the approach where a company outsources its CRM systems and functions. In-house CRM, on the other hand, has been the model traditionally adopted by larger organizations, with a tailored on-premise implementation. Hosted CRM solutions are where the software or technology solution and IT support function is outsourced, with subscription costs for maintenance, support, upgrades and training. This approach is not limited to CRM and is referred to as “Software as a Service” (SaaS) by the wider IT community.
In considering approaches to CRM, organisations are looking to best align their people and systems with operational and strategic objectives including a focus on improving customer service, revenue uplift and flexibility with speed to market.
There is a great deal of importance with ROI on subscription vs implementation, and related delivery and operational costs, but fundamentals that drive business value with the view of customer across the organization and systems is an important aspect of the business case.
Questions are asked on functionality or out-of-the-box features supported by a CRM system, however the ability for CRM to integrate into an organization’s overall Enterprise Data Management Strategy and Solution will underpin the achievement of these objectives.
Organisations are asking “what are the important considerations with Return on Investment (ROI) with out-of-the-box (OOTB) hosted solutions compared to in house solution implementation?” For one, the “investment” aspect is notably cheaper with hosted solutions. You can save up to 80% by hosting, according to the Aberdeen Group, a Boston-based consulting firm.
Aside from that point, some of the key considerations and questions that should be discussed along side the common OOTB and ROI aspects are:
What are your data integration requirements? Do you need to take data from multiple sources and can you move between the “in-house” and “hosted” solutions with ease?
How important is Business Intelligence (BI) and does the CRM implementation support business decision making tasks by feeding into shared Information Repositories?
Is your organization governed by regulations and does you CRM system need to conform to these regulations?
Although is this by no means an exhaustive list of considerations, it illustrates that in parallel with a comparison of vertical solutions provided by vendors (hosted, and in-house), there is a need for alignment with an overall information management approach.
MIKE 2.0 takes a holistic view of information management and the holistic approach can be leveraged in order to assist in strategic planning and CRM solution selection while not compromising the overall strategy currently in place. We invite you to join us on this initiative.