Posts Tagged ‘data quality’
In a previous post, I discussed some data quality and data governance issues associated with open data. In his recent blog post How far can we trust open data?, Owen Boswarva raised several good points about open data.
“The trustworthiness of open data,” Boswarva explained, “depends on the particulars of the individual dataset and publisher. Some open data is robust, and some is rubbish. That doesn’t mean there’s anything wrong with open data as a concept. The same broad statement can be made about data that is available only on commercial terms. But there is a risk attached to open data that does not usually attach to commercial data.”
Data quality, third-party rights, and personal data were three grey areas Boswarva discussed. Although his post focused on a specific open dataset published by an agency of the government of the United Kingdom (UK), his points are generally applicable to all open data.
As Boswarva remarked, the quality of a lot of open data is high even though there is no motivation to incur the financial cost of verifying the quality of data being given away for free. The “publish early even if imperfect” principle also encourages a laxer data quality standard for open data. However, “the silver lining for quality-assurance of open data,” Boswarva explained is that “open licenses maximize re-use, which means more users and re-users, which increases the likelihood that errors will be detected and reported back to the publisher.”
The issue of third-party rights raised by Boswarva was one that I had never considered. His example was the use of a paid third-party provider to validate and enrich postal address data before it is released as part of an open dataset. Therefore, consumers of the open dataset benefit from postal validation and enrichment without paying for it. While the UK third-party providers in this example acquiesced to open re-use of their derived data because their rights were made clear to re-users (i.e., open data consumers), Boswarva pointed out that re-users should be aware that using open data doesn’t provide any protection from third-party liability and, more importantly, doesn’t create any obligation on open data publishers to make sure re-users are aware of any such potential liability. While, again, this is a UK example, that caution should be considered applicable to all open data in all countries.
As for personal data, Boswarva noted that while open datasets are almost invariably non-personal data, “publishers may not realize that their datasets contain personal data, or that analysis of a public release can expose information about individuals.” The example in his post centered on the postal addresses of property owners, which without the names of the owners included in the dataset, are not technically personal data. However, it is easy to cross-reference this with other open datasets to assemble a lot of personally identifiable information that if it were contained in one dataset would be considered a data protection violation (at least in the UK).
Calls for increased transparency and accountability lead government agencies around the world to make more information available to the public as open data. As more people accessed this information, it quickly became apparent that data quality and data governance issues complicate putting open data to use.
“It’s an open secret,” Joel Gurin wrote, “that a lot of government data is incomplete, inaccurate, or almost unusable. Some agencies, for instance, have pervasive problems in the geographic data they collect: if you try to map the factories the EPA regulates, you’ll see several pop up in China, the Pacific Ocean, or the middle of Boston Harbor.”
A common reason for such data quality issues in the United States government’s data is what David Weinberger wrote about Data.gov. “The keepers of the site did not commit themselves to carefully checking all the data before it went live. Nor did they require agencies to come up with well-formulated standards for expressing that data. Instead, it was all just shoveled into the site. Had the site keepers insisted on curating the data, deleting that which was unreliable or judged to be of little value, Data.gov would have become one of those projects that each administration kicks further down the road and never gets done.”
Of course, the United States is not alone in either making government data open (about 60 countries have joined the Open Government Partnership) or having it reveal data quality issues. Victoria Lemieux recently blogged about data issues hindering the United Kingdom government’s Open Data program in her post Why we’re failing to get the most out of open data.
One of the data governances issues Lemieux highlighted was data provenance. “Knowing where data originates and by what means it has been disclosed,” Lemieux explained, “is key to being able to trust data. If end users do not trust data, they are unlikely to believe they can rely upon the information for accountability purposes.” Lemieux explained that determining data provenance can be difficult since “it entails a good deal of effort undertaking such activities as enriching data with metadata, such as the date of creation, the creator of the data, who has had access to the data over time. Full comprehension of data relies on the ability to trace its origins. Without knowledge of data provenance, it can be difficult to interpret the meaning of terms, acronyms, and measures that data creators may have taken for granted, but are much more difficult to decipher over time.”
I think the bad press about open data is a good thing because open data is opening eyes to two basic facts about all data. One, whenever data is made available for review, you will discover data quality issues. Two, whenever data quality issues are discovered, you will need data governance to resolve them. Therefore, the reason we’re failing to get the most out of open data is the same reason we fail to get the most out of any data.
The second of the five biggest data myths debunked by Gartner is many IT leaders believe that the huge volume of data that organizations now manage makes individual data quality flaws insignificant due to the law of large numbers.
Their view is that individual data quality flaws don’t influence the overall outcome when the data is analyzed because each flaw is only a tiny part of the mass of big data. “In reality,” as Gartner’s Ted Friedman explained, “although each individual flaw has a much smaller impact on the whole dataset than it did when there was less data, there are more flaws than before because there is more data. Therefore, the overall impact of poor-quality data on the whole dataset remains the same. In addition, much of the data that organizations use in a big data context comes from outside, or is of unknown structure and origin. This means that the likelihood of data quality issues is even higher than before. So data quality is actually more important in the world of big data.”
“Convergence of social, mobile, cloud, and big data,” Gartner’s Svetlana Sicular blogged, “presents new requirements: getting the right information to the consumer quickly, ensuring reliability of external data you don’t have control over, validating the relationships among data elements, looking for data synergies and gaps, creating provenance of the data you provide to others, spotting skewed and biased data. In reality, a data scientist job is 80% of a data quality engineer, and just 20% of a researcher, dreamer, and scientist.”
This aligns with Steve Lohr of The New York Times reporting that data scientists are more often data janitors since they spend from 50 percent to 80 percent of their time mired in the more mundane labor of collecting and preparing unruly big data before it can be mined to discover the useful nuggets that provide business insights.
“As the amount and type of raw data sources increases exponentially,” Stefan Groschupf blogged, “data quality issues can wreak havoc on an organization. Data quality has become an important, if sometimes overlooked, piece of the big data equation. Until companies rethink their big data analytics workflow and ensure that data quality is considered at every step of the process—from integration all the way through to the final visualization—the benefits of big data will only be partly realized.”
So no matter what you heard or hoped, the truth is big data needs data quality too.
The third of the five biggest data myths debunked by Gartner is big data technology will eliminate the need for data integration. The truth is big data technology excels at data acquisition, not data integration.
This myth is rooted in what Gartner referred to as the schema on read approach used by big data technology to quickly acquire a variety of data from sources with multiple data formats.
This is best exemplified by the Hadoop Distributed File System (HDFS). Unlike the predefined, and therefore predictably structured, data formats required by relational databases, HDFS is schema-less. It just stores data files, and those data files can be in just about any format. Gartner explained that “many people believe this flexibility will enable end users to determine how to interpret any data asset on demand. It will also, they believe, provide data access tailored to individual users.”
While it was a great innovation to make data acquisition schema-less, more work has to be done to develop information because, as Gartner explained, “most information users rely significantly on schema on write scenarios in which data is described, content is prescribed, and there is agreement about the integrity of data and how it relates to the scenarios.”
It has always been true that whenever you acquire data in various formats, it has to be transformed into a common format before it can be further processed and put to use. After schema on read and before schema on write is the schema in between.
Data integration is the schema in between. It always has been. Big data technology has not changed this because, as I have previously blogged, data stored in HDFS is not automatically integrated. And it’s not just Hadoop. Data integration is not a natural by-product of any big data technology, which is one of the reasons why technology is only one aspect of a big data solution.
Just as it has always been, in between data acquisition and data usage there’s a lot that has to happen. Not just data integration, but data quality and data governance too. Big data technology doesn’t magically make any of these things happen. In fact, big data just makes us even more painfully aware there’s no magic behind data management’s curtain, just a lot of hard work.
In previous posts on this blog, I have discussed aspects of data quality using travel reviews, the baseball strike zone, Christmas songs, the wind chill factor, and the bell curve.
Grab a big popcorn and a giant soda since this post discusses data quality using movie ratings.
The Following Post has been Approved for All Audiences
In the United States prior to 1922, every state and many cities had censorship boards that prevented movies from being shown in local theaters on the basis of “immoral” content. What exactly was immoral varied by censorship board.
In 1922, the Motion Picture Association of America (MPAA), representing all the major American movie studios, was formed in part to encourage the movie industry to censor itself. Will Hays, the first MPAA president, helped develop what came to be called the Hays Code, which used a list of criteria to rate movies as either moral or immoral.
For three decades, if a movie failed the Hays Code most movie theaters around the country would not show it. After World War II ended, however, views on movie morality began to change. Frank Sinatra received an Oscar nomination for his role as a heroin addict in the 1955 drama The Man with the Golden Arm. Jack Lemmon received an Oscar nomination for his role as a cross-dressing musician in the 1959 comedy Some Like It Hot. Both movies failed the Hays Code, but were booked by movie theaters based on good reviews and became big box office hits.
Then in 1968, a landmark Supreme Court decision (Ginsberg v. New York) ruled that states could “adjust the definition of obscenity as applied to minors.” Fearing the revival of local censorship boards, the MPAA created a new rating system intended to help parents protect their children from obscene material. Even though the ratings carried no legal authority, parents were recommended to use it as a guide in deciding what movies their children should see.
While a few changes have occurred over the years (most notably adding PG-13 in 1984), these are the same movie ratings we know today: G (General Audiences, All Ages Admitted), PG (Parental Guidance Suggested, Some Material may not be Suitable for Children), PG-13 (Parents Strongly Cautioned, Some Material may be Inappropriate for Children under 13), R (Restricted, Under 17 requires Accompanying Parent or Adult Guardian), and NC-17 (Adults Only, No One 17 and Under Allowed). For more on these ratings and how they are assigned, read this article by Dave Roos.
What Movie Ratings teach us about Data Quality
Just like the MPAA learned with the failure of the Hays Code to rate movies as either moral or immoral, data quality can not simply be rated as good or bad. Perspectives about the quality standards for data, like the moral standards for movies, changes over time. For example, consider how big data challenges traditional data quality standards.
Furthermore, good and bad, like moral and immoral, are ambiguous. A specific context is required to help interpret any rating. For the MPAA, the specific context became rating movies based on obscenity from the perspective of parents. For data quality, the specific context is based on fitness for the purpose of use from the perspective of business users.
Adding context, however, does not guarantee everyone will agree with the rating. Debates rage over the rating given to a particular movie. Likewise, many within your organization will disagree with the quality assessment of certain data. So the next time your organization calls a meeting to discuss its data quality standards, you might want to grab a big popcorn and a giant soda since the discussion could be as long and dramatic as a Peter Jackson trilogy.
When big data is discussed, we rightfully hear a lot of concerns about signal-to-noise ratio. Amidst the rapidly expanding galaxy of information surrounding us every day, most of what we hear sounds like meaningless background static. We are so overloaded but underwhelmed by information, perhaps we are becoming tone deaf to signal, leaving us hearing only noise.
Before big data burst onto the scene, bursting our eardrums with its unstructured cacophony, most of us believed we had data quality standards and therefore we knew good information when we heard it. Just as most of us believe we would know great music when we hear it.
In 2007, Joshua Bell, a world-class violinist and recipient of that year’s Avery Fisher Prize, an award given to American musicians for outstanding achievement in classical music, participated in an experiment for The Washington Post columnist Gene Weingarten, whose article about the experiment won the 2008 Pulitzer Prize for feature writing.
During the experiment, Bell posed as a street performer donned in jeans and a baseball cap and played violin for tips in a Washington, D.C. metro station during one Friday morning rush hour. For 45 minutes he performed 6 classical music masterpieces, some of the most elegant music ever written on one of the most valuable violins ever made.
Weingarten described it as “an experiment in context, perception, and priorities—as well as an unblinking assessment of public taste: In a banal setting at an inconvenient time, would beauty transcend? Each passerby had a quick choice to make, one familiar to commuters in any urban area where the occasional street performer is part of the cityscape: Do you stop and listen? Do you hurry past with a blend of guilt and irritation, aware of your cupidity but annoyed by the unbidden demand on your time and your wallet? Do you throw in a buck, just to be polite? Does your decision change if he’s really bad? What if he’s really good? Do you have time for beauty?”
As Bell performed, over 1,000 commuters passed by him. Most barely noticed him at all. Very few stopped briefly to listen to him play. Even fewer were impressed enough to toss a little money into his violin case. Although three days earlier he had performed the same set in front of a sold out crowd at Boston Symphony Hall where ticket prices for good seats started at $100 each, at the metro station Bell earned only $32.17 in tips.
“If a great musician plays great music but no one hears,” Weingarten pondered, “was he really any good?”
That metro station during rush hour is an apt metaphor for the daily experiment in context, perception, and priorities facing information development in the big data era. In a fast-paced business world overcrowded with fellow travelers and abuzz with so much background noise, if great information is playing but no one hears, is it really any good?
Do you know good information when you hear it? How do you hear it amidst the noise surrounding it?
What is more important, the process or its outcome? Information management processes, like those described by the MIKE2.0 Methodology, drive the daily operations of an organization’s business functions as well as support the tactical and strategic decision-making processes of its business leaders. However, an organization’s success or failure is usually measured by the outcomes produced by those processes.
As Duncan Watts explained in his book Everything Is Obvious: How Common Sense Fails Us, “rather than the evaluation of the outcome being determined by the quality of the process that led to it, it is the observed nature of the outcome that determines how we evaluate the process.” This is known as outcome bias.
While an organization is enjoying positive outcomes, such as exceeding its revenue goals for the current fiscal period, outcome bias basks processes in a rose-colored glow. Information management processes must be providing high-quality data to decision-making processes, which business leaders are using to make good decisions. However, when an organization is suffering from negative outcomes, such as a regulatory compliance failure, outcome bias blames it on broken information management processes and poor data quality that lead to bad decision-making.
“Judging the merit of a decision can never be done simply by looking at the outcome,” explained Jeffrey Ma in his book The House Advantage: Playing the Odds to Win Big In Business. “A poor result does not necessarily mean a poor decision. Likewise a good result does not necessarily mean a good decision.”
“We are prone to blame decision makers for good decisions that worked out badly and to give them too little credit for successful moves that appear obvious after the fact,” explained Daniel Kahneman in his book Thinking, Fast and Slow.
While risk mitigation is an oft-cited business justification for investing in information management, Kahneman also noted how outcome bias can “bring undeserved rewards to irresponsible risk seekers, such as a general or an entrepreneur who took a crazy gamble and won. Leaders who have been lucky are never punished for having taken too much risk. Instead, they are believed to have had the flair and foresight to anticipate success. A few lucky gambles can crown a reckless leader with a halo of prescience and boldness.”
Outcome bias triggers overreactions to both success and failure. Organizations that try to reverse engineer a single, successful outcome into a formal, repeatable process often fail, much to their surprise. Organizations also tend to abandon a new process immediately if its first outcome is a failure. “Over time,” Ma explained, “if one makes good, quality decisions, one will generally receive better outcomes, but it takes a large sample set to prove this.”
Your organization needs solid processes governing how information is created, managed, presented, and used in decision-making. Your organization also needs to guard against outcomes biasing your evaluation of those processes.
In order to overcome outcome bias, Watts recommended we “bear in mind that a good plan can fail while a bad plan can succeed—just by random chance—and therefore judge the plan on its own merits as well as the known outcome.”
I have written about several aspects of the Internet of Things (IoT) in previous posts on this blog (The Internet of Humans, The Quality of Things, and Is it finally the Year of IoT?). Two recent articles have examined a few of the things that are interrupting the progress of IoT.
In his InformationWeek article Internet of Things: What’s Holding Us Back, Chris Murphy reported that while “companies in a variety of industries — transportation, energy, heavy equipment, consumer goods, healthcare, hospitality, insurance — are getting measurable results by analyzing data collected from all manner of machines, equipment, devices, appliances, and other networked things” (of which his article notes many examples), there are things slowing the progress of IoT.
One of its myths, Murphy explained, “is that companies have all the data they need, but their real challenge is making sense of it. In reality, the cost of collecting some kinds of data remains too high, the quality of the data isn’t always good enough, and it remains difficult to integrate multiple data sources.”
The fact is a lot of the things we want to connect to IoT weren’t built for Internet connectivity, so it’s not as simple as just sticking a sensor on everything. Poor data quality, especially timeliness, but also completeness and accuracy, is impacting the usefulness of the data transmitted by things currently connected to sensors. Other issues Murphy explores in his article include the lack of wireless network ubiquity, data integration complexities, and securing IoT data centers and devices against Internet threats.
“The clearest IoT successes today,” Murphy concluded, “are from industrial projects that save companies money, rather than from projects that drive new revenue. But even with these industrial projects, companies shouldn’t underestimate the cultural change they need to manage as machines start telling veteran machine operators, train drivers, nurses, and mechanics what’s wrong, and what they should do, in their environment.”
In his Wired article Why Tech’s Best Minds Are Very Worried About the Internet of Things, Klint Finley reported on the findings of a survey about IoT from the Pew Research Center. While potential benefits of IoT were noted, such as medical devices monitoring health and environmental sensors detecting pollution, potential risks were highlighted, security being one of the most immediate concerns. “Beyond security concerns,” Finley explained, “there’s the threat of building a world that may be too complex for our own good. If you think error messages and applications crashes are a problem now, just wait until the web is embedded in everything from your car to your sneakers. Like the VCR that forever blinks 12:00, many of the technologies built into the devices of the future may never be used properly.”
The research also noted concerns about IoT expanding the digital divide, ostracizing those who can not, or choose not to, participate. Data privacy concerns were also raised, including the possibility of dehumanizing the workplace by monitoring employees through wearable devices (e.g., your employee badge being used to track your movements).
Some survey respondents hinted at a division similar to what we hear about the cloud, differentiating a public IoT from a private IoT. In the latter case, instead of sacrificing our privacy for the advantages of connected devices, there’s no reason our devices can’t connect to a private network instead of the public internet. Other survey respondents believe that IoT has been overhyped as the next big thing and while it will be useful in some areas (e.g., the military, hospitals, and prisons) it will not pervade mainstream culture and therefore will not invade our privacy as many fear.
What Say You about IoT?
Please share your viewpoint about IoT by leaving a comment below.
While data visualization often produces pretty pictures, as Phil Simon explained in his new book The Visual Organization: Data Visualization, Big Data, and the Quest for Better Decisions, “data visualization should not be confused with art. Clarity, utility, and user-friendliness are paramount to any design aesthetic.” Bad data visualizations are even worse than bad art since, as Simon says, “they confuse people more than they convey information.”
Simon explained how data scientist Melinda Thielbar recommends using data visualization to help an analyst communicate with a nontechnical audience, as well as help the data communicate with the analyst.
“Visualization is a great way to let the data tell a story,” Thielbar explained. “It’s also a great way for analysts to fool themselves into believing the story they want to believe.” This is why she recommends developing the visualizations at the beginning of the analysis to allow the visualizations that really illustrate the story behind the data to stand out, a process she calls “building windows into the data.” When you look through a window, you may not like what you see.
“Data visualizations may include bad, suspect, duplicate, or incomplete data,” Simon explained. This can be a good thing, however, since data visualizations “can help users identify fishy information and purify data faster than manual hunting and pecking. Data quality is a continuum, not a binary. Use data visualization to improve data quality.” Even when you are looking at what appears to be the pretty end of the continuum, Simon cautioned that “just because data is visualized doesn’t necessarily mean that it is accurate, complete, or indicative of the right course of action.”
Especially when dealing with volume aspect of big data, data visualization can help find outliers faster. While detailed analysis is needed to determine whether the outlier is a business insight or a data quality issue, data visualization can help you shake those needles out of the haystack and into a clear field of vision.
Among its many other uses, which Simon illustrates well in his book, finding ugly data with pretty pictures is one way data visualization can be used for improving data quality.
A few weeks ago, while reading about the winners at the 56th Annual Grammy Awards, I saw that Daft Punk won both Record of the Year and Album of the Year, which made me wonder what the difference is between a record and an album.
Then I read that Record of the Year is awarded to the performer and the production team of a single song. While Daft Punk won Record of the Year for their song “Get Lucky”, the song was not lucky enough to win Song of the Year (that award went to Lorde for her song “Royals”).
My confusion about the semantics of the Grammy Awards prompted a quick trip to Wikipedia, where I learned that Record of the Year is awarded for either a single or individual track from an album. This award goes to the performer and the production team for that one song. In this context, record means a particular recorded song, not its composition or an album of songs.
Although Song of the Year is also awarded for a single or individual track from an album, the recipient of this award is the songwriter who wrote the lyrics to the song. In this context, song means the song as composed, not its recording.
The Least Ambiguous Award goes to Album of the Year, which is indeed awarded for a whole album. This award goes to the performer and the production team for that album. In this context, album means a recorded collection of songs, not the individual songs or their compositions.
These distinctions, and the confusion it caused me, seemed eerily reminiscent of the challenges that happen within organizations when data is ambiguously defined. For example, terms like customer and revenue are frequently used without definition or context. When data definitions are ambiguous, it can easily lead to incorrect uses of data as well as confusing references to data during business discussions.
Not only is it difficult to reach consensus on data definitions, definitions change over time. For example, Record of the Year used to be awarded to only the performer, not the production team. And the definition of who exactly counts as a member of the production team has been changed four times over the years, most recently in 2013.
Avoiding semantic inconsistencies, such as the difference between a baker and a Baker, is an important aspect of metadata management. Be diligent with your data definitions and avoid daft definitions for sound semantics.
TODAY: Tue, March 28, 2017March2017