Posts Tagged ‘open data’
In a previous post, I discussed some data quality and data governance issues associated with open data. In his recent blog post How far can we trust open data?, Owen Boswarva raised several good points about open data.
“The trustworthiness of open data,” Boswarva explained, “depends on the particulars of the individual dataset and publisher. Some open data is robust, and some is rubbish. That doesn’t mean there’s anything wrong with open data as a concept. The same broad statement can be made about data that is available only on commercial terms. But there is a risk attached to open data that does not usually attach to commercial data.”
Data quality, third-party rights, and personal data were three grey areas Boswarva discussed. Although his post focused on a specific open dataset published by an agency of the government of the United Kingdom (UK), his points are generally applicable to all open data.
As Boswarva remarked, the quality of a lot of open data is high even though there is no motivation to incur the financial cost of verifying the quality of data being given away for free. The “publish early even if imperfect” principle also encourages a laxer data quality standard for open data. However, “the silver lining for quality-assurance of open data,” Boswarva explained is that “open licenses maximize re-use, which means more users and re-users, which increases the likelihood that errors will be detected and reported back to the publisher.”
The issue of third-party rights raised by Boswarva was one that I had never considered. His example was the use of a paid third-party provider to validate and enrich postal address data before it is released as part of an open dataset. Therefore, consumers of the open dataset benefit from postal validation and enrichment without paying for it. While the UK third-party providers in this example acquiesced to open re-use of their derived data because their rights were made clear to re-users (i.e., open data consumers), Boswarva pointed out that re-users should be aware that using open data doesn’t provide any protection from third-party liability and, more importantly, doesn’t create any obligation on open data publishers to make sure re-users are aware of any such potential liability. While, again, this is a UK example, that caution should be considered applicable to all open data in all countries.
As for personal data, Boswarva noted that while open datasets are almost invariably non-personal data, “publishers may not realize that their datasets contain personal data, or that analysis of a public release can expose information about individuals.” The example in his post centered on the postal addresses of property owners, which without the names of the owners included in the dataset, are not technically personal data. However, it is easy to cross-reference this with other open datasets to assemble a lot of personally identifiable information that if it were contained in one dataset would be considered a data protection violation (at least in the UK).
Calls for increased transparency and accountability lead government agencies around the world to make more information available to the public as open data. As more people accessed this information, it quickly became apparent that data quality and data governance issues complicate putting open data to use.
“It’s an open secret,” Joel Gurin wrote, “that a lot of government data is incomplete, inaccurate, or almost unusable. Some agencies, for instance, have pervasive problems in the geographic data they collect: if you try to map the factories the EPA regulates, you’ll see several pop up in China, the Pacific Ocean, or the middle of Boston Harbor.”
A common reason for such data quality issues in the United States government’s data is what David Weinberger wrote about Data.gov. “The keepers of the site did not commit themselves to carefully checking all the data before it went live. Nor did they require agencies to come up with well-formulated standards for expressing that data. Instead, it was all just shoveled into the site. Had the site keepers insisted on curating the data, deleting that which was unreliable or judged to be of little value, Data.gov would have become one of those projects that each administration kicks further down the road and never gets done.”
Of course, the United States is not alone in either making government data open (about 60 countries have joined the Open Government Partnership) or having it reveal data quality issues. Victoria Lemieux recently blogged about data issues hindering the United Kingdom government’s Open Data program in her post Why we’re failing to get the most out of open data.
One of the data governances issues Lemieux highlighted was data provenance. “Knowing where data originates and by what means it has been disclosed,” Lemieux explained, “is key to being able to trust data. If end users do not trust data, they are unlikely to believe they can rely upon the information for accountability purposes.” Lemieux explained that determining data provenance can be difficult since “it entails a good deal of effort undertaking such activities as enriching data with metadata, such as the date of creation, the creator of the data, who has had access to the data over time. Full comprehension of data relies on the ability to trace its origins. Without knowledge of data provenance, it can be difficult to interpret the meaning of terms, acronyms, and measures that data creators may have taken for granted, but are much more difficult to decipher over time.”
I think the bad press about open data is a good thing because open data is opening eyes to two basic facts about all data. One, whenever data is made available for review, you will discover data quality issues. Two, whenever data quality issues are discovered, you will need data governance to resolve them. Therefore, the reason we’re failing to get the most out of open data is the same reason we fail to get the most out of any data.
In his book Open Data Now: The Secret to Hot Startups, Smart Investing, Savvy Marketing, and Fast Innovation, Joel Gurin explained a type of Open Data called Smart Disclosure, which was defined as “the timely release of complex information and data in standardized, machine-readable formats in ways that enable consumers to make informed decisions.”
As Gurin explained, “Smart Disclosure combines government data, company information about products and services, and data about an individual’s own needs to help consumers make personalized decisions. Since few people are database experts, most will use this Open Data through an intermediary—a choice engine that integrates the data and helps people filter it by what’s important to them, much the way travel sites do for airline and hotel booking. These choice engines can tailor the options to fit an individual’s circumstances, budget, and priorities.”
Remember (if you are old enough) what it was like to make travel arrangements before websites like Expedia, Orbitz, Travelocity, Priceline, and Kayak existed, and you can imagine the immense consumer-driven business potential for applying Smart Disclosure and choice engines to every type of consumer decision.
“Smart Disclosure works best,” Gurin explained, “when it brings together data about the services a company offers with data about the individual consumer. Smart Disclosure includes giving consumers data about themselves—such as their medial records, cellphone charges, or patterns of energy use—so they can choose the products and services uniquely suited to their needs. This is Open Data in a special sense: it’s open only to the individual whom the data is about and has to be released to each person under secure conditions by the company or government agency that holds the data. It’s essential that these organizations take special care to be sure the data is not seen by anyone else. Many people may balk at the idea of having their personal data released in a digital form. But if the data is kept private and secure, giving personal data back to individuals is one of the most powerful aspects of Smart Disclosure.”
Although it sounds like a paradox, the best way to secure our personal data may be to make it open. Currently most of our own personal data is closed—especially to us, which is the real paradox.
Some of our personal data is claimed as proprietary information by the companies we do business with. Data about our health is cloaked by government regulations intended to protect it, but which mostly protects doctors from getting sued while giving medical service providers and health insurance companies more access to our medical history than we have.
If all of our personal data was open to us, and we controlled the authorization of secure access to it, our personal data would be both open and secure. This would simultaneously protect our privacy and improve our choice as consumers.
In his book Open Data Now: The Secret to Hot Startups, Smart Investing, Savvy Marketing, and Fast Innovation, Joel Gurin explained that Open Data and Big Data are related but very different.
While various definitions exist, Gurin noted that “all definitions of Open Data include two basic features: the data must be publicly available for anyone to use, and it must be licensed in a way that allows for its reuse. Open Data should also be in a form that makes it relatively easy to use and analyze, although there are gradations of openness. And there’s general agreement that Open Data should be free of charge or cost just a minimal amount.”
“Big Data involves processing very large datasets to identify patterns and connections in the data,” Gurin explained. “It’s made possible by the incredible amount of data that is generated, accumulated, and analyzed every day with the help of ever-increasing computer power and ever-cheaper data storage. It uses the data exhaust that all of us leave behind through our daily lives. Our mobile phones’ GPS systems report back on our location as we drive; credit card purchase records show what we buy and where; Google searches are tracked; smart meters in our homes record our energy usage. All are grist for the Big Data mill.”
Private and Passive versus Public and Purposeful
Gurin explained that Big Data tends to be private and passive, whereas Open Data tends to be public and purposeful.
“Big Data usually comes from sources that passively generate data without purpose, without direction, or without even realizing that they’re creating it. And the companies and organizations that use Big Data usually keep the data private for business or security reasons. This includes the data that large retailers hold on customers’ buying habits, that hospitals hold about their patients, and that banks hold about their credit card holders.”
By contrast, Open Data “is consciously released in a way that anyone can access, analyze, and use as he or she sees fit. Open Data is also often released with a specific purpose in mind—whether the goal is to spur research and development, fuel new businesses, improve public health and safety, or achieve any number of other objectives.”
“While Big Data and Open Data each have important commercial uses, they are very different in philosophy, goals, and practice. For example, large companies may use Big Data to analyze customer databases and target their marketing to individual customers, while they use Open Data for market intelligence and brand building.”
Big and Open Data
Gurin also noted, however, that some of the most powerful results arise when Big Data and Open Data overlap.
“Some government agencies have made very large amounts of data open with major economic benefits. National weather data and GPS data are the most often-cited examples. U.S. census data and data collected by the Securities and Exchange Commission and the Department of Health and Human Services are others. And nongovernmental research has produced large amounts of data, particularly in biomedicine, that is now being shared openly to accelerate the pace of scientific discovery.”
Data Open for Business
Gurin addressed the apparent paradox of Open Data: “If Open Data is free, how can anyone build a business on it? The answer is that Open Data is the starting point, not the endpoint, in deriving value from information.” For example, even though weather and GPS data have been available for decades, those same Open Data starting points continue to spark new ideas, generating new, and profitable, endpoints.
While data privacy still requires sensitive data not be shared without consent and competitive differentiation still requires an organization’s intellectual property not be shared, that still leaves a vast amount of other data which, if made available as Open Data, will make more data open for business.
Few computing and technological achievements rival IBM’s Watson. Its impressive accomplishments to this point include high-profile victories in chess and Jeopardy!
Turns out that we ain’t seen nothin’ yet. Its next incarnation will be much more developer-friendly. From a recent GigaOM piece:
Developers who want to incorporate Watson’s ability to understand natural language and provide answers need only have their applications make a REST API call to IBM’s new Watson Developers Cloud. “It doesn’t require that you understand anything about machine learning other than the need to provide training data,” Rob High, IBM’s CTO for Watson, said in a recent interview about the new platform.
The rationale to embrace platform thinking is as follows: As impressive as Watson is, even an organization as large as IBM (with over 400,000 employees) does not hold a monopoly on smart people. Platforms and ecosystems can take Watson in myriad directions, many of which you and I can’t even anticipate. Innovation is externalized to some extent. (If you’re a developer curious to get started, knock yourself out.)
Continue reading the article and you’ll see that Watson 2.0 “ships” not only with an API, but an SDK, an app store, and a data marketplace. That is, the more data Watson has, the more it can learn. Can someone say network effect?
Think about it for a minute. A data marketplace? Really? Doesn’t information really want to be free?
Well, yes and no. There’s no dearth of open data on the Internet, a trend that shows no signs of abating. But let’s not overdo it. The success of Kaggle has shown that thousands of organizations are willing to pay handsomely for data that solves important business problems, especially if that data is timely, accurate, and aggregated well. As a result, data marketplaces are becoming increasingly important and profitable.
Simon Says: Embrace Data and Platform Thinking
The market for data is nothing short of vibrant. Big Data has arrived, but not all data is open, public, free, and usable.
Combine the explosion of data with platform thinking. It’s not just about the smart cookies who work for you. There’s no shortage of ways to embrace platforms and ecosystems, even if you’re a mature company. Don’t just look inside your organization’s walls for answers to vexing questions. Look outside. You just might be amazed at what you’ll find.
What say you?
TODAY: Tue, March 28, 2017March2017