Posts Tagged ‘data standards’
In previous posts on this blog, I have discussed aspects of data quality using travel reviews, the baseball strike zone, Christmas songs, the wind chill factor, and the bell curve.
Grab a big popcorn and a giant soda since this post discusses data quality using movie ratings.
The Following Post has been Approved for All Audiences
In the United States prior to 1922, every state and many cities had censorship boards that prevented movies from being shown in local theaters on the basis of “immoral” content. What exactly was immoral varied by censorship board.
In 1922, the Motion Picture Association of America (MPAA), representing all the major American movie studios, was formed in part to encourage the movie industry to censor itself. Will Hays, the first MPAA president, helped develop what came to be called the Hays Code, which used a list of criteria to rate movies as either moral or immoral.
For three decades, if a movie failed the Hays Code most movie theaters around the country would not show it. After World War II ended, however, views on movie morality began to change. Frank Sinatra received an Oscar nomination for his role as a heroin addict in the 1955 drama The Man with the Golden Arm. Jack Lemmon received an Oscar nomination for his role as a cross-dressing musician in the 1959 comedy Some Like It Hot. Both movies failed the Hays Code, but were booked by movie theaters based on good reviews and became big box office hits.
Then in 1968, a landmark Supreme Court decision (Ginsberg v. New York) ruled that states could “adjust the definition of obscenity as applied to minors.” Fearing the revival of local censorship boards, the MPAA created a new rating system intended to help parents protect their children from obscene material. Even though the ratings carried no legal authority, parents were recommended to use it as a guide in deciding what movies their children should see.
While a few changes have occurred over the years (most notably adding PG-13 in 1984), these are the same movie ratings we know today: G (General Audiences, All Ages Admitted), PG (Parental Guidance Suggested, Some Material may not be Suitable for Children), PG-13 (Parents Strongly Cautioned, Some Material may be Inappropriate for Children under 13), R (Restricted, Under 17 requires Accompanying Parent or Adult Guardian), and NC-17 (Adults Only, No One 17 and Under Allowed). For more on these ratings and how they are assigned, read this article by Dave Roos.
What Movie Ratings teach us about Data Quality
Just like the MPAA learned with the failure of the Hays Code to rate movies as either moral or immoral, data quality can not simply be rated as good or bad. Perspectives about the quality standards for data, like the moral standards for movies, changes over time. For example, consider how big data challenges traditional data quality standards.
Furthermore, good and bad, like moral and immoral, are ambiguous. A specific context is required to help interpret any rating. For the MPAA, the specific context became rating movies based on obscenity from the perspective of parents. For data quality, the specific context is based on fitness for the purpose of use from the perspective of business users.
Adding context, however, does not guarantee everyone will agree with the rating. Debates rage over the rating given to a particular movie. Likewise, many within your organization will disagree with the quality assessment of certain data. So the next time your organization calls a meeting to discuss its data quality standards, you might want to grab a big popcorn and a giant soda since the discussion could be as long and dramatic as a Peter Jackson trilogy.
While security and privacy issues prevent sensitive data from being shared (e.g., customer data containing personal financial information or patient data containing personal health information), do you have access to data that would be more valuable if you shared it with the rest of your organization—or perhaps the rest of the world?
We are all familiar with the opposite of data sharing within an organization—data silos. Somewhat ironically, many data silos start with data that was designed to be shared with the entire organization (e.g., from an enterprise data warehouse), but was then replicated and customized in order to satisfy the particular needs of a tactical project or strategic initiative. This customized data often becomes obsolesced after the conclusion (or abandonment) of its project or initiative.
Data silos are usually denounced as evil, but the real question is whether the data hoarded within a silo is sharable—is it something usable by the rest of the organization, which may be redundantly storing and maintaining their own private copies of the same data, or are the contents of the data silo something only one business unit uses (or is allowed to access in the case of sensitive data).
Most people decry data silos as the bane of successful enterprise data management—until you expand the scope of data beyond the walls of the organization, where the enterprise’s single version of the truth becomes a cherished data asset (i.e., an organizational super silo) intentionally siloed from the versions of the truth maintained within other organizations, especially competitors.
We need to stop needlessly replicating and customizing data—and start reusing and sharing data.
Historically, replication and customization had two primary causes:
- Limitations in technology (storage, access speed, processing speed, and a truly sharable infrastructure like the Internet) meant that the only option was to create and maintain an internal copy of all data.
- Proprietary formats and customized (and also proprietary) versions of common data was viewed as a competitive differentiation—even before the recent dawn of the realization that data is a corporate asset.
Hoarding data in a proprietary format and viewing “our private knowledge is our power” must be replaced with shared data in an open format and viewing “our shared knowledge empowers us all.”
This is an easier mantra to recite than it is to realize, not only within an organization or industry, but even more so across organizations and industries. However, one of the major paradigm shifts of 21st century data management is making more data publicly available, following open standards (such as MIKE2.0) and using unambiguous definitions so data can be easily understood and reused.
Of course, data privacy still requires sensitive data not be shared without consent, and competitive differentiation still requires intellectual property not be shared outside the organization. But this still leaves a vast amount of data, which if shared, could benefit our increasingly hyper-connected world where most of the boundaries that used to separate us are becoming more virtual every day. Some examples of this were made in the recent blog post shared by Henrik Liliendahl Sørensen about Winning by Sharing Data.
A few weeks ago, while reading about the winners at the 56th Annual Grammy Awards, I saw that Daft Punk won both Record of the Year and Album of the Year, which made me wonder what the difference is between a record and an album.
Then I read that Record of the Year is awarded to the performer and the production team of a single song. While Daft Punk won Record of the Year for their song “Get Lucky”, the song was not lucky enough to win Song of the Year (that award went to Lorde for her song “Royals”).
My confusion about the semantics of the Grammy Awards prompted a quick trip to Wikipedia, where I learned that Record of the Year is awarded for either a single or individual track from an album. This award goes to the performer and the production team for that one song. In this context, record means a particular recorded song, not its composition or an album of songs.
Although Song of the Year is also awarded for a single or individual track from an album, the recipient of this award is the songwriter who wrote the lyrics to the song. In this context, song means the song as composed, not its recording.
The Least Ambiguous Award goes to Album of the Year, which is indeed awarded for a whole album. This award goes to the performer and the production team for that album. In this context, album means a recorded collection of songs, not the individual songs or their compositions.
These distinctions, and the confusion it caused me, seemed eerily reminiscent of the challenges that happen within organizations when data is ambiguously defined. For example, terms like customer and revenue are frequently used without definition or context. When data definitions are ambiguous, it can easily lead to incorrect uses of data as well as confusing references to data during business discussions.
Not only is it difficult to reach consensus on data definitions, definitions change over time. For example, Record of the Year used to be awarded to only the performer, not the production team. And the definition of who exactly counts as a member of the production team has been changed four times over the years, most recently in 2013.
Avoiding semantic inconsistencies, such as the difference between a baker and a Baker, is an important aspect of metadata management. Be diligent with your data definitions and avoid daft definitions for sound semantics.
The weakness of using relative rankings instead of absolute standards was exemplified in the TED Book by Jim Hornthal, A Haystack Full of Needles, using the humorous story of how Little Rock, Arkansas was initially rated as one of the world’s premier destinations for museums, romance, and fine dining by Triporati.com, a travel site intended to offer personalized destination recommendations powered by expert travel advice.
During alpha testing of their recommendation engine, Triporati discovered that the expert assigned to rate Little Rock had awarded the city’s best museums and restaurants with the highest score (5) because they were “the best that Little Rock has.”
No offense intended to Little Rock, Arkansas, but this is why you need data quality standards.
As Hornthal explained, “relative ranking alone is deadly.” Triporati’s experts had to be re-trained to understand that the goal wasn’t to determine “What is the best that a city has to offer?” but rather, “How do these attributes compare to the best?”
To put their ratings into context for the experts, Triporati explained the rating system used by the recommendation engine for attractions (e.g., museums, restaurants) should represent the following:
- 5 — People would come from all over the world to visit the attraction
- 4 — People would come from all over the country to visit the attraction
- 3 — People would come from all across a state or region to visit the attraction
- 2 — People could be enticed to cross a city to visit the attraction
- 1 — People might be enticed to cross the street to visit the attraction
In the business scenario of the Triporati recommendation engine, improving data quality required the establishment of an absolute standard. Most data quality professionals would say that this is an example of defining data quality as real-world alignment.
An alternative definition of data quality is fitness for the purpose of use, which is more akin to a relative ranking. A common challenge is that data often has multiple business uses, each with its own fitness requirements, which is why applying data investigation to different business problems reveals different data quality tolerances that an absolute standard cannot reflect.
Furthermore, since some fine folks in the Southern United States might argue that Little Rock, Arkansas being a premier destination for museums and fine dining is more aligned with their real world, and since fine dining that is fit for the purpose of my use would be any restaurant that serves bacon double cheeseburgers and local micro-brewed beers, there is simply no accounting for taste.
But there needs to be an accounting for data quality, which is why you need data quality standards.
As part of MIKE2.0, we believe we are presenting a unique perspective in the area of standards development. Our approach is to create a collaborative community for the development of standards for Information Management, including those that apply to Capital Markets.
Some interesting work around open source and open standards is developing in relation to market data:
- Market Data Definition Language (MDDL) is an extensible Markup Language (XML) derived specification, which facilitates the interchange of information about financial instruments used throughout the world’s markets. A community is build around MDDL, including a wiki-based development environment.
With open content and collaborative technologies, it’s easy for these projects to work together and we’ve starting doing this through MIKE2.0 with references to these projects.
Organizations typically suffer from a lack of standards around information management. They develop standards on their own although they may use external reference materials. The issue is that most of the standards are definitional, but not validation-based. That is, the standards may say how a data warehouse model should be developed or provide a policy about how reference data should be synchronized.
What is missing is the validation step against these standards. What would be valuable are validation tools that test areas such as complexity while the solution is being developed. When we have simple tools like those used for W3C Markup Validation Service it will be a big help in the industry.
TODAY: Fri, April 28, 2017April2017