Open Framework, Information Management Strategy & Collaborative Governance | Data & Social Methodology - MIKE2.0 Methodology
Members
Collapse Expand Close

To join, please contact us.

Improve MIKE 2.0
Collapse Expand Close
Need somewhere to start? How about the most wanted pages; or the pages we know need more work; or even the stub that somebody else has started, but hasn't been able to finish. Or create a ticket for any issues you have found.

Archive for the ‘Master Data Management’ Category

by: Philsimon
02  Aug  2010

Charlie Rose, Customer Service, and the Master Twitter Record

My post last week detailed the customer service-oriented frustrations that many of us face while dealing with large companies.Writing about it was cathartic and probably saved me at least one trip to a shrink.

In a related and inspired post, Maria Ogneva writes about a future world in which companies and customers are able to efficiently interact via different media:

  • phone
  • email
  • Twitter handle
  • physical address (required to be sure, although it just seems so dated)

    It’s an interesting post and one well worth checking out.

    All of this makes me wonder a few things:

    • Do many large organizations even have a master customer record anywhere?
    • Which people and departments can access this record? What about updating it?
    • What specific data elements are on that record?
    • Since Twitter handle is probably not one of those elements, from a technical standpoint, how hard would it be to add? Adding it is sure preferable to the separate creation of a master Twitter Record (MTR) that would invariably get out of sync with a “main” one.

      Now, no one here is claiming that Twitter ought to be the primary means for a company to deal with its customers. First, not everyone is on Twitter. Second, that aside for a moment, a direct message of no more than 140 characters is probably too restrictive to resolve an even moderately complex customer issue. Finally, it’s so easy to retweet that companies would (probably justifiably) fear making certain responses public–at least so easily.

      On the other hand, why not have that club in the bag? Use it when needed. this would be like my rarely-used three iron on the golf course. I don’t use it often, but when I need it, I’m sure glad that I have it.

      Big Company Customer Service

      I don’t buy into the notion that a large organization cannot provide state-of-the-art customer service. It’s a matter of priorities and will. There’s no business or technology limitation to being able to take care of the people who take care of you.

      Case in point: Amazon.com.

      Recently on the Charlie Rose show, Amazon CEO Jeff Bezos talked about the company’s relentless focus on the customer from day one. There were those early on who claimed that Amazon was a “cute little company” but, once the heavyweights like Wal-Mart embraced e-commerce, they would crush the Amazons of the world. Bezos laughs now about the “Amazon.toast” references from the mid-1990s.

      Reports of Amazon’s demise were premature. Today, to say that the company does a good job managing customer information to create a seamless buying experience is an understatement. Also, consider that Amazon isn’t really “just  Amazon.” Consider that many small companies maintain stores on the Amazon site. To buy a book or pair of shoes, one need not reenter his or her information multiple times to make a purchase.

      With that in mind, is it really too much to ask a bit telecomm company to get its act together?

      Simon Says

      Look, when a company of any size makes a customer service error, the aggrieved customer is going to tell people about it. Lots of people. The company cannot control anything that the customer says or does after that point. This is a far cry, though, from claiming that the company is helpless against a constant stream of negative feedback via tweets, emails, blog posts, discussion boards, and the like. Use a mistake as an opportunity to rectify a bad situation. Big mouths and active fingers work both ways: Those same customers are likely to point out that at least the company did the right thing in the end–and doing the right thing requires good data on those customers.

      Feedback

      What say you?

      Tags: ,
      Category: Information Value, Master Data Management
      2 Comments »

      by: Philsimon
      07  Jun  2010

      Are We Seeing the Death of Freemium Model?

      This past April, popular social networking site Ning announced that it would no longer be able to offer its services for free. In an e-mail to his 40-percent-reduced employees, Ning CEO Jason Rosenthal wrote:

      Our premium Ning networks like Friends or Enemies, Linkin Park, Shred or Die, Pickens Plan, and tens of thousands of others … drive 75 percent of our monthly U.S. traffic, and those network creators need and will pay for many more services and features from us.”

      It shouldn’t be surprising that Rosenthal’s tone was rife with hope. But what if some or even most of Ning’s networks do not opt to pay for previously free services? I personally have been sent emails from soon-to-be-former Ning networks about their plans to move to a different platform rather than pony up.

      Dissecting the Ning Decision

      Those unfamiliar with Ning might think that the company is the brainchild of a few crazy kids without a great deal of business acumen. Think Chat Roulette. That’s hardly the case. One of the company’s primary visionaries and investors is Marc Andreessen, a man who has made billions from successful technology-based ventures.

      Perhaps you’re thinking that Ning never gained any traction? Wrong again. At the time of the announcement, the company’s Alexa rank was 126 and the number of Ning networks in existence was in the hundreds of thousands. Many popular Ning networks had tens of thousands of users, putting the company’s reach easily into the millions. The bottom line is that Ning could not sustain the Freemium model outlined in Chris Anderson’s popular book Free: The Future of a Radical Price.

      The History of Freemium and a Possible Domino Effect?

      For those of you not familiar with the Freemium model, it boils down to this definition from Wikipedia:

      “Give your service away for free, possibly ad supported but maybe not, acquire a lot of customers very efficiently through word of mouth, referral networks, organic search marketing, etc., then offer premium priced value added services or an enhanced version of your service to your customer base.”

      Now, if Ning were one of few companies attempting to grow its business via Freemium, then it could be dismissed as an aberration. It can’t. The model is pervasive. In fact, most firms these days receiving venture capital (VC) funding operate under some type of Freemium model.

      Consider the fact that open source (OS) software companies are utilizing Freemium. For example, in April of 2010, OS data solutions company Talend received an additional $8M in VC funding. Talend allows anyone to download its software for free and use many of its bells and whistles. To unlock advanced features, however, clients have to pay.

      What if the vast majority of Talend clients decide that 70 percent of a product’s functionality for free trumps all functionality with a bill? Would the Talend business model crumble? Based on what happened to Ning, will VCs ultimately become skeptical of the Freemium model and refuse to fund companies that rely upon it? As David Heinemeier Hansson wrote in a post on 37signals.com, “Eyeballs still don’t pay the bills.”

      Feedback

      This begs the question: Is the Freemium model ultimately sustainable? Remember that open source does not mean free. As Heather Meeker wrote in my second book about open souce and the nature of free, “Think free speech, not free beer.”

      What do you think?

      Category: Master Data Management, Open Source, Web2.0
      8 Comments »

      by: Larry.dubov
      14  Aug  2009

      Quantifying Data Quality with Information Theory

      Information Theory Approach to Data Quality for MDM

      Introduction

      Over the past decade data quality has been a major focus for data management professionals, data government organizations, and other data quality stakeholders across the enterprise. Still the quality of data remains low for many organizations. To a considerable extent this is caused by a lack of scientifically or at least consistently defined data quality metrics. Data professionals are still lacking a common methodology that would enable them to measure data quality objectively in terms of scientifically defined metrics and compare data sets in terms of their quality across systems, departments and corporations.  

       

      Even though many data profiling metrics exist, their usage is not scientifically justified. Consequently enterprises and their departments apply their own standards or apply no standards at all.

       

      As a result, regulatory agencies, executive management and data governance organizations are lacking a standard, objective and scientifically defined way to articulate data quality requirements and measure data quality improvement progress. An elusiveness of data quality results in that job performance of the enterprise roles responsible for data quality lacks consistently defined criteria, which ultimately causes limited progress in data quality improvements.

       

      A quantitative approach to data quality, if developed and adopted by data management community, would enable data professionals to better prioritize data quality issues and take corrective actions proactively and efficiently.

       

      In this article we will discuss a scientific approach to data quality for MDM based on Information Theory. This approach seems to be a good candidate to address the aforementioned problem.

       

      Approaches to Data Quality

      At a high level there are two well-known and broadly used approaches to data quality. Typically both of them are used to a certain degree by every enterprise.

       

      The first approach is mostly application driven and oftentimes referred to as a “fit-for-purpose” approach. Oftentimes business users determine that certain application queries or reports do not return the right data. For instance if a query that is supposed to fetch top 10 Q2 customers does not return some of the customers the business expects to see, in depth data analysis follows. The data analysis may determine that some customer records are duplicated and some transaction records have incorrect or missing transaction dates. This type of finding can trigger some activities aimed at understanding of the data issues and corrective actions.

       

      An advantage of this approach to data quality is that it is aligned with tactical needs of business functions, groups and departments. A disadvantage of this approach is that it addresses data quality issues re-actively based upon business request or even complaint. Some data quality issues may not be easy to discover and business users cannot decide which report is right and which one is wrong. The organization may eventually draw a conclusion that their data is bad but would not be able to indicate what exactly needs to be fixed in the data, which limits the IT’s abilities to fix the issues. When multiple LOB’s and functions across the enterprise struggle with their specific data quality issues separately, it is difficult to quantify the overall state of data quality and define priorities with which data quality problems are to be addressed by the enterprise.

       

      The second approach is based on data profiling. Data profiling tools are intended to make a data quality improvement process more pro-active and measurable. A number of data profiling metrics is typically introduced to screen for missing and invalid attributes, duplicate records, duplicate attribute values that are supposed to be unique, frequency of attributes, cardinality of attributes and their allowed values, standardization and validation of certain data formats for simple and complex attribute types, violations of referential integrity, etc. A limitation of the data profiling techniques is in that an additional analysis is required to understand which of the metrics are most important for the business and why. It may not be easy to come up with a definitive answer and translate it into a data quality improvement action plan. The variety of data profiling metrics is not based on science but rather driven by the variety of ways relational database technology can report on data quality issues.

       

      Each of the two approaches above has its niche and significance. When the quality of master data is in question an alternative and more strategic approach can be considered by data governance organizations. This approach avoids detailed analysis of business applications while providing a solid scientific foundation for its metrics.

      Information Theory Approach to Data Quality for MDM  

      Master data are those data which are foundational to business processes, are usually widely distributed, which, when well managed, are directly contributing
      to the success of an organization, and when not well managed pose the most risk. Customer, Patient, Citizen, Member, Client, Member, Broker, Product, Financial Instrument, Drug are the entities oftentimes referred to as master data entities while company specific selection of master entities is driven by the enterprise business and focus.

       

      Master Data Service (MDS) defines its primary function as the creation of the “golden view” of the master entities. We will assume that MDS has successfully created and maintains the “golden view” of entity F in the data hub. This “golden record” can be dynamic or persistent. There exist a number of data sources across the enterprise with the data corresponding to domain F. This includes the source systems that feed the data hub and other data sources that may be not integrated with the data hub. We will define an external dataset f which data quality is to be quantified with respect to F. For the purpose of this discussion f can represent any data set such as a single data source or multiple sources.

       

      Our goal is to compare the source data set f with the entity data set F. The data quality of the data set f will be characterized by how well it represents the benchmark entity F defined as the “golden view” for the data in domain F. We are making an assumption here that the “golden view” was created algorithmically and then validated by the data stewards.

       

      In Information Theory the information quantity associated with the entity F is expressed in terms of the entropy:

                                                    

                          H(F) = – ∑ Pk log Pk,                                                                                            (1)   

                                                    

      where Pk are the probabilities of the attribute (token) values in the “golden” data set F. Index “K” runs over all records in F and all attributes. The base in the log function is 2.

       

      H(F) represents the quantity of information in the “golden” representation of entity F.

       

      Similarly for the comparison data set f

       

                          H(f) = – ∑ pi log pi,                                                                                            (2)   

       

      We will use small “p” for the probabilities associated with f while capital letter “P” is used for the probabilities characterizing the “golden” entity record.

       

      Mutual entropy J(f,F) characterizes how well f represents F.

                         

      J(f,F) = H(f) + H(F) – H(f,F)                                                                        (3)   

       

      In (3) H(f,F) is the joint entropy of f and F. It is expressed in terms of probabilities of combined events, e.g. the probability that the name = “Smith” in “the golden record” F and name = “Schmidt” in the source record linked to the same entity. The behavior of J qualifies this function as a good candidate quantifying the data quality of f with respect to F. When the data quality is low, the correlation between f and F is low. In an extreme case of a very low data quality f doesn’t correlate with F and these variables are independent. Then

       

                          H(f,F) = H(f) + H(F)                                                                                      (4)   

       

      and

       

                          J(f,F) = 0                                                                                                       (5)   

       

      If f represents F extremely well, e.g. f = F, then H(f) = H(F) = H(f,F) and

       

                          J(f,F) = H(F)                                                                                                  (6)   

       

      We define Data Quality of f with respect to F by the following equation:

       

                          DQ(f,F) = J(f,F)/H(F)                                                                                      (7)   

       

      With this definition of data quality DQ changes from 0 to 1, where 0 indicates the data quality of f is minimal; f does not represent F.  When DQ = 1 f perfectly represents F and the data quality of f with respect to F is 100%, and therefore f represents F perfectly well.

       

      The approach can also be used to determine partial attribute/token level data quality. This will provide additional insights into what causes most significant data quality issues.

       

      The data quality improvement should be done iteratively. Changes in the data source data may impact the “golden record”. Then equations (1) and (7) are applied again to recalculate the data quantity and data quality characteristics.

       

      Conclusion

      The article offers an Information Theory based method for quantifying Information Assets and the Data Quality of the Assets through equations (1) and (7). The proposed method leverages the notion of a “golden record” created and maintained in the data hub. The “golden record” is used as the benchmark against which the data quality of other sources is measured.

       

      Organizations can leverage this approach to augment its data governance offerings for MDM and make our data governance approach truly unique. The quantitative approach to data quality will ultimately help data governance organizations develop policies based on scientifically defined data quality and quantity metrics.

       

      By applying this approach consistently on a number of engagements, over time we will accumulate valuable insights into how metrics (1) and (7) apply to real world data characteristics and scenarios. We will develop good practices defining acceptable data quality thresholds, e.g. it might be a future industry policy for P&C insurance business to keep the quality of Customer data above the 92% mark, which sets clearly articulated data governance policy based on scientifically sound approach to data quality metrics.

       

      The developed approach can be incorporated in the future products to enable data governance and provide data governance organizations with new tooling. Data governance will be able to select information sources and assets to be measured, quantify them according to (1) and (7), set the target metrics for data stewards, measure the progress on an on-going basis and report on the data quality improvement progress.

       

      Even though we are mainly focusing on data quality, the quantity of data in equation (1) characterizes the overall significance of a corporate data set from the Information Theory perspective. For M&A the method can be used to measure an additional amount of information that the joint enterprise will have compared to the information owned by the companies separately. The approach developed above will measure both the information acquired due to the difference in the customer bases and the information quantity increment due to better and more precise and useful information about the existing customers.

       

        Simple Illustrative Examples

      In this Appendix we will apply the theory developed above to two simple illustrative cases. We will define the “golden” data set F as follows:

       

      EID

      Name

      State

      1

      Larry

      NJ

      2

      Jim

      GA

      3

      Scott

      CA

      4

      Marty

      CA

       

      The probabilities of attributes values in F are:

       

      Value

      Probability (P)

      log P

      p log p

      Larry

      0.25

      -2

      -0.5

      Jim

      0.25

      -2

      -0.5

      Scott

      0.25

      -2

      -0.5

      Marty

      0.25

      -2

      -0.5

      NJ

      0.25

      -2

      -0.5

      GA

      0.25

      -2

      -0.5

      CA

      0.5

      -1

      -0.5

      Scenario 1

      Dataset f is the same as the “golden” data set. Then

       

                                                      f = F, H(f) = H(F) = 3.5.

       

      The probability matrix for combined values:

       

      Value

      Probability (P)

      log P

      p log p

      Larry, Larry

      0.25

      -2

      -0.5

      Jim, Jim

      0.25

      -2

      -0.5

      Scott, Scott

      0.25

      -2

      -0.5

      Marty, Marty

      0.25

      -2

      -0.5

      NJ,NJ

      0.25

      -2

      -0.5

      GA,GA

      0.25

      -2

      -0.5

      CA, CA

      0.5

      -1

      -0.5

       

       

       

       

      H(f,F) = -∑Pk logPk =

       

      3.5

       

      and

       

                                                               H(F) = H(f) = H(f,F) = 3.5

       

       

                                                   J(f,F) = H(F) + H(f) – H(f,F) = H(F) = 3.5

       

      Equation (7) yields

       

                                                                 DQ = J(f,F)/H(F) = 1

       

      As expected the data quality of f when f = F yields 1 or 100%

       

      Scenario 2

      Dataset for the “golden record” F remains the same as in scenario 1.

       

                                                                    H(F) = 3.5

       

      We will change dataset f by adding a new record: “Larry, CA”. We will assume that the new record for “Larry” represent the same individual as “Larry, NJ”. Therefore records “Larry, NJ” and “Larry, CA” will have the same EID = 1. Data stewards determined that “NJ” is the right value for the attribute State.  Dataset f is as follows:

       

                                                     

      EID

      Name

      State

      1

      Larry

      NJ

      2

      Jim

      GA

      3

      Scott

      CA

      4

      Marty

      CA

      1

      Larry

      CA

       

       The probability matrix for f is:

       

       

       

       

       

       

      Value

      Probability (P)

      log P

      p log p

      Larry

      0.4

      -1.32193

      -0.528771238

      Jim

      0.2

      -2.32193

      -0.464385619

      Scott

      0.2

      -2.32193

      -0.464385619

      Marty

      0.2

      -2.32193

      -0.464385619

      NJ

      0.2

      -2.32193

      -0.464385619

      GA

      0.2

      -2.32193

      -0.464385619

      CA

      0.6

      -0.73697

      -0.442179356

       

       

       

       

      H(f) =

       

       

      3.292878689

       

       

       

      The probability matrix for combined values:

      Value

      Probability (P)

      log P

      p log p

      Larry, Larry

      0.4

      -1.32193

      -0.528771238

      Jim, Jim

      0.2

      -2.32193

      -0.464385619

      Scott, Scott

      0.2

      -2.32193

      -0.464385619

      Marty, Marty

      0.2

      -2.32193

      -0.464385619

      NJ,NJ

      0.2

      -2.32193

      -0.464385619

      GA,GA

      0.2

      -2.32193

      -0.464385619

      CA, CA

      0.4

      -1.32193

      -0.528771238

      NJ, CA

      0.2

      -2.32193

      -0.464385619

       

       

       

       

      H(f,E) =

       

       

      3.84385619

       

      Substituting the values for H(F), H(f) and H(f,F) into 7 we will obtain:

       

                                       J(f,F) = 3.5 + 3.292878689 – 3.84385619 = 2.949022499

       

                                        DQ = J(f,F)/H(F) = 2.949022499/3.5 = 0.842577857 or ~ 84%

      Category: Data Quality, Enterprise Data Management, Information Development, Information Governance, Information Management, Information Value, Master Data Management
      5 Comments »

      by: Robert.hillard
      17  Nov  2007

      Facebook as a CDI

      It’s been a lucrative five years for consultants in information management with new work being as easy to win as saying the word “compliance”.  Executives are more than willing to sign-up new consulting engagements based the need to meet their compliance and regulatory requirements.  The trouble is, this type of information management engagement breeds a defensive rather than a confident enterprise.

      A defensive organization believes that data needs to be locked-down, that risks need to be taken out and the analysis resulting from any dataset should be predictable.  Of course, any regular reader of this blog would know that we view data contained in large enterprises as complex and displaying all of the attributes of chaos mathematics which means any attempt to remove surprises from data is a fruitless endeavor.

      A confident organization, on the other hand, recognizes that data is complex and chaotic but seeks to gain benefit from that complexity.  Rather than be afraid of randomness, they use the techniques of MIKE2.0 to identify the risks and then focus on monitoring and measuring.  In general, I observe a strong correlation between the confident enterprise and the adoption of Web 2.0 techniques and principles.  The confident organization believes that there is more value in collaboration and is willing to sponsor individual innovation.

      A good example of why this is so important can be seen in social networking sites such as Facebook.  With the rapid growth in their use by a new generation of consumers, service providers ranging from telecommunication and financial services right through to government, need to come to grips with both the technology and the cultural drivers behind them.  Consumers are becoming more confident in sharing quite detailed information about themselves in a way that they expect others to pick-up.  Increasingly it will make no sense for providers to ask individuals to provide data about their relationships, locale or other details when those are already available in the public web.

      In fact, one of the reasons why Facebook is so powerful is its ability to interface into custom applications.  Imagine the impact if you wanted to sell these consumers a new financial or telecommunications product and you made it possible to apply online from within Facebook!  More importantly, you can give the individual a sense of control by allowing them to privately share critical information with you and then maintain it in a form with which they are comfortable – perhaps for a multitude of providers.

      Obviously there are challenges in this type of initiative, but good use of data measurement, reconciliation and parsing approaches allow it to be done.  The question is whether your enterprise has even considered whether it’s worth doing?  You can bet it won’t be long before your competitors do!

      Tags: , ,
      Category: Enterprise2.0, MIKE2.0, Master Data Management, Web2.0
      8 Comments »

      by: Sean.mcclowry
      21  Oct  2007

      Globalization and Name Recognition

      Organizations that focus on individual consumers often to struggle to identify their customers at the most basic level – their name. There are many reasons for this:

      • Capture-dependent: spelling mistakes
      • Customer-dependent: name changes
      • Application-dependent: packing multiple fields into a single field
      • Architecture-dependent: conflicting names for the same person across systems

      These different types of issues then become increasingly difficult to address in a complex organization such as a retail bank or telco where dozens to hundreds of systems may hold customer records.

      Collectively, Customer Data Integration (CDI) means doing all these things well and helps address what was a cause of failure on many Customer Relationship Management (CRM) implementations. Vendors such as IBM, Syperion, Initiate and Oracle offer CDI-specific Solutions and the market is undergoing rapid growth.

      Over the last few years there have been significant benefits to addressing these issues through better governance, data quality improvement programmes and upgrades to new applications that were more sophisticated in their capability to store customer data.

      This involves fixing historical issues and minimizing the chance of errors occurring in the future.

      One of the challenges that globalization brings is around name recognition. Techniques that have been applied over the past few years simply do not work as well with many Eastern European, North African, Middle Eastern and Asian names. The phonetic translations that convert Arabic names into a Western form are typically inconsistent.

      Living in London, I see the Retail Banking sector facing perhaps the greatest complexity worldwide. Rapidly changing demographics require new techniques and technologies to solve this name recognition issue. Once again, big vendors are moving into this space through acquisition – with IBM offering a specific product – GNR – to meet the globalization name challenge.

      Tags: , , ,
      Category: Master Data Management
      No Comments »

      Calendar
      Collapse Expand Close
      TODAY: Fri, September 3, 2010
      September2010
      SMTWTFS
      2930311234
      567891011
      12131415161718
      19202122232425
      262728293012
      Archives
      Collapse Expand Close
      Recent Comments
      Collapse Expand Close