Posts Tagged ‘data governance’
In a previous post, I discussed some data quality and data governance issues associated with open data. In his recent blog post How far can we trust open data?, Owen Boswarva raised several good points about open data.
“The trustworthiness of open data,” Boswarva explained, “depends on the particulars of the individual dataset and publisher. Some open data is robust, and some is rubbish. That doesn’t mean there’s anything wrong with open data as a concept. The same broad statement can be made about data that is available only on commercial terms. But there is a risk attached to open data that does not usually attach to commercial data.”
Data quality, third-party rights, and personal data were three grey areas Boswarva discussed. Although his post focused on a specific open dataset published by an agency of the government of the United Kingdom (UK), his points are generally applicable to all open data.
As Boswarva remarked, the quality of a lot of open data is high even though there is no motivation to incur the financial cost of verifying the quality of data being given away for free. The “publish early even if imperfect” principle also encourages a laxer data quality standard for open data. However, “the silver lining for quality-assurance of open data,” Boswarva explained is that “open licenses maximize re-use, which means more users and re-users, which increases the likelihood that errors will be detected and reported back to the publisher.”
The issue of third-party rights raised by Boswarva was one that I had never considered. His example was the use of a paid third-party provider to validate and enrich postal address data before it is released as part of an open dataset. Therefore, consumers of the open dataset benefit from postal validation and enrichment without paying for it. While the UK third-party providers in this example acquiesced to open re-use of their derived data because their rights were made clear to re-users (i.e., open data consumers), Boswarva pointed out that re-users should be aware that using open data doesn’t provide any protection from third-party liability and, more importantly, doesn’t create any obligation on open data publishers to make sure re-users are aware of any such potential liability. While, again, this is a UK example, that caution should be considered applicable to all open data in all countries.
As for personal data, Boswarva noted that while open datasets are almost invariably non-personal data, “publishers may not realize that their datasets contain personal data, or that analysis of a public release can expose information about individuals.” The example in his post centered on the postal addresses of property owners, which without the names of the owners included in the dataset, are not technically personal data. However, it is easy to cross-reference this with other open datasets to assemble a lot of personally identifiable information that if it were contained in one dataset would be considered a data protection violation (at least in the UK).
One of my favorite books is SuperFreakonomics by economist Steven Levitt and journalist Stephen Dubner, in which, as with their first book and podcast, they challenge conventional thinking on a variety of topics, often revealing counterintuitive insights about how the world works.
One of the many examples from the book is their analysis of the Endangered Species Act (ESA) passed by the United States in 1973 with the intention to protect critically imperiled species from extinction.
Levitt and Dubner argued the ESA could, in fact, be endangering more species than it protects. After a species is designated as endangered, the next step is to designate the geographic areas considered critical habitats for that species. After an initial set of boundaries is made, public hearings are held, allowing time for developers, environmentalists, and others to have their say. The process to finalize the critical habitats can take months or even years. This lag time creates a strong incentive for landowners within the initial geographic boundaries to act before their property is declared a critical habitat or out of concern that it could attract endangered species. Trees are cut down to make their land less hospitable or development projects are fast-tracked before ESA regulation would prevent them. This often has the unintended consequence of hastening the destruction of more critical habitats and expediting the extinction of more endangered species.
This made me wonder whether data governance could be endangering more data than it protects.
After a newly launched data governance program designates the data that must be governed, the next step is to define the policies and procedures that will have to be implemented. A series of meetings are held, allowing time for stakeholders across the organization to have their say. The process to finalize the policies and procedures can take weeks or even months. This lag time provides an opportunity for developing ways to work around data governance processes once they are in place, or ways to simply not report issues. Either way this can create the facade that data is governed when, in fact, it remains endangered.
Just as it’s easy to make the argument that endangered species should be saved, it’s easy to make the argument that data should be governed. Success is a more difficult argument. While the ESA has listed over 2,000 endangered species, only 28 have been delisted due to recovery. That’s a success rate of only one percent. While the success rate of data governance is hopefully higher, as Loraine Lawson recently blogged, a lot of people don’t know if their data governance program is on the right track or not. And that fact in itself might be endangering data more than not governing data at all.
Calls for increased transparency and accountability lead government agencies around the world to make more information available to the public as open data. As more people accessed this information, it quickly became apparent that data quality and data governance issues complicate putting open data to use.
“It’s an open secret,” Joel Gurin wrote, “that a lot of government data is incomplete, inaccurate, or almost unusable. Some agencies, for instance, have pervasive problems in the geographic data they collect: if you try to map the factories the EPA regulates, you’ll see several pop up in China, the Pacific Ocean, or the middle of Boston Harbor.”
A common reason for such data quality issues in the United States government’s data is what David Weinberger wrote about Data.gov. “The keepers of the site did not commit themselves to carefully checking all the data before it went live. Nor did they require agencies to come up with well-formulated standards for expressing that data. Instead, it was all just shoveled into the site. Had the site keepers insisted on curating the data, deleting that which was unreliable or judged to be of little value, Data.gov would have become one of those projects that each administration kicks further down the road and never gets done.”
Of course, the United States is not alone in either making government data open (about 60 countries have joined the Open Government Partnership) or having it reveal data quality issues. Victoria Lemieux recently blogged about data issues hindering the United Kingdom government’s Open Data program in her post Why we’re failing to get the most out of open data.
One of the data governances issues Lemieux highlighted was data provenance. “Knowing where data originates and by what means it has been disclosed,” Lemieux explained, “is key to being able to trust data. If end users do not trust data, they are unlikely to believe they can rely upon the information for accountability purposes.” Lemieux explained that determining data provenance can be difficult since “it entails a good deal of effort undertaking such activities as enriching data with metadata, such as the date of creation, the creator of the data, who has had access to the data over time. Full comprehension of data relies on the ability to trace its origins. Without knowledge of data provenance, it can be difficult to interpret the meaning of terms, acronyms, and measures that data creators may have taken for granted, but are much more difficult to decipher over time.”
I think the bad press about open data is a good thing because open data is opening eyes to two basic facts about all data. One, whenever data is made available for review, you will discover data quality issues. Two, whenever data quality issues are discovered, you will need data governance to resolve them. Therefore, the reason we’re failing to get the most out of open data is the same reason we fail to get the most out of any data.
The third of the five biggest data myths debunked by Gartner is big data technology will eliminate the need for data integration. The truth is big data technology excels at data acquisition, not data integration.
This myth is rooted in what Gartner referred to as the schema on read approach used by big data technology to quickly acquire a variety of data from sources with multiple data formats.
This is best exemplified by the Hadoop Distributed File System (HDFS). Unlike the predefined, and therefore predictably structured, data formats required by relational databases, HDFS is schema-less. It just stores data files, and those data files can be in just about any format. Gartner explained that “many people believe this flexibility will enable end users to determine how to interpret any data asset on demand. It will also, they believe, provide data access tailored to individual users.”
While it was a great innovation to make data acquisition schema-less, more work has to be done to develop information because, as Gartner explained, “most information users rely significantly on schema on write scenarios in which data is described, content is prescribed, and there is agreement about the integrity of data and how it relates to the scenarios.”
It has always been true that whenever you acquire data in various formats, it has to be transformed into a common format before it can be further processed and put to use. After schema on read and before schema on write is the schema in between.
Data integration is the schema in between. It always has been. Big data technology has not changed this because, as I have previously blogged, data stored in HDFS is not automatically integrated. And it’s not just Hadoop. Data integration is not a natural by-product of any big data technology, which is one of the reasons why technology is only one aspect of a big data solution.
Just as it has always been, in between data acquisition and data usage there’s a lot that has to happen. Not just data integration, but data quality and data governance too. Big data technology doesn’t magically make any of these things happen. In fact, big data just makes us even more painfully aware there’s no magic behind data management’s curtain, just a lot of hard work.
A Facebook experiment from late 2012 made news earlier this year and raised the ethical question of whether, by using free services provided via the Internet and mobile apps, we have granted informed consent to be experimented on for whatever purposes.
The On the Media TLDR audio podcast recently posted an interview with Christian Rudder, the co-founder of the free dating website OkCupid, who recently blogged about how OkCupid experiments on its users in a post with the intentionally provocative title We Experiment On Human Beings!
While this revelation understandably attracted a lot of attention, at least OkCupid is not trying to hide what it’s doing. Furthermore, as Rudder blogged, “guess what, everybody: if you use the Internet, you’re the subject of hundreds of experiments at any given time, on every site. That’s how websites work.”
During the interview, Rudder made an interesting comparison between what websites like Facebook and OkCupid do and how psychologists and other social scientists have been experimenting on human beings for decades. This point resonated with me since I have read a lot of books that explore how and why humans behave and think the way we do. Just a few examples are Predictably Irrational by Dan Ariely, You Are Not So Smart by David McRaney, Nudge by Richard Thaler and Cass Sunstein, and Thinking, Fast and Slow by Daniel Kahneman.
Most of the insights discussed in these books are based on the results of countless experiments, most of which were performed on college students since they can be paid in pizza, course credit, or beer money. The majority of the time the subjects in these experiments are not fully informed about the nature of the experiment. In fact, many times they are intentionally misinformed in order to not skew the results of the experiment.
Rudder argued the same thing is done to improve websites. So why do we see hallowed halls when we envision the social scientists behind university research, but we see creepy cubicles when we envision the data scientists behind website experimentation? Perhaps we trust academic more than commercial applications of science.
During the interview, Rudder addressed the issue of trust. Users of OkCupid are trusting the service to provide them with good matches and Rudder acknowledged how experimenting on users can seem like a violation of that trust. “However,” Rudder argued, “doing experiments to make sure that what we’re recommending is the best job that we could possibly do is upholding that trust, not violating it.”
It’s easy to argue that the issue of informed consent regarding experimentation on a dating or social networking website is not the same as informed consent regarding government surveillance, such as last year’s PRISM scandal. The latter is less about experimentation and more about data privacy, where often we are our own worst enemy.
But who actually reads the terms and conditions for a website or mobile app? If you do not accept the terms and conditions, you can’t use it, so most of us accept them by default without bothering to read them. Technically, this constitutes informed consent, which is why it may simply be an outdated concept in the information age.
The information age needs enforced accountability (aka privacy through accountability), which is less about informed consent and more about holding service providers accountable for what they do with our data. This includes the data resulting from the experiments they perform on us. Transparency is an essential aspect of that accountability, allowing us to make an informed decision about what websites and mobile apps we want to use.
However, to Rudder’s point, we are fooling ourselves if we think that such transparency would allow us to avoid using the websites and mobile apps that experiment on us. They all do. They have to in order to be worth using.
“In a microsecond economy,” Becca Lipman recently blogged, “most data is only useful in the first few milliseconds, or to an extent, hours after it is created. But the way the industry is collecting data, or more accurately, hoarding it, you’d think its value lasts a lifetime. Yes, storage costs are going down and selecting data to delete is no easy task, especially for the unstructured and unclassified sets. And fear of deleting something that could one day be useful is always going to be a concern. But does this give firms the go-ahead to be data hoarders?”
Whether we choose to measure it in terabytes, petabytes, exabytes, HoardaBytes, or how harshly reality bites, we have been hoarding data long before the data management industry took a super-sized tip from McDonald’s and put the word “big” in front of its signature sandwich. At least McDonald’s starting phasing out their super-sized menu options in 2004, stating the need to offer healthier food choices, a move that was perhaps motivated in some small way by the success of Morgan Spurlock’s Academy Award Nominated documentary film Super Size Me.
Much like fast food is an enabler for our chronic overeating and the growing epidemic of obesity, big data is an enabler for our data hoarding compulsion and the growing epidemic of data obesity.
Does this data smell bad to you?
Are there alternatives to data hoarding? Perhaps we could put an expiration date on data, which after it has been passed we could at least archive, if not delete, the expired data. One challenge with this approach is that with most data you can not say exactly when it will expire. Even if we could, however, expiration dates for data might be as meaningless as the expiration dates we currently have for food.
As Rose Eveleth reported, “these dates are—essentially—made up. Nobody regulates how long milk or cheese or bread stays good, so companies can essentially print whatever date they want on their products.” Eveleth shared links to many sources that basically recommended ignoring the dates and relying on seeing if the food looks or smells bad.
What about regulatory compliance?
Another enabler of data hoarding is concerns about complying with future regulations. This is somewhat analogous to income tax preparation in the United States where many people hoard boxes of receipts for everything in hopes of using them to itemize their tax deductions. Even though most of the contents are deemed irrelevant when filing an official tax return, some people still store the boxes in their attic just in case of a future tax audit.
How useful would data from this source be?
Although calculating the half-life of data has always been problematic, Larry Hardesty recently reported on a new algorithm developed by MIT graduate student Dan Levine and his advisor Jonathan How. By using the algorithm, Levine explained, “the usefulness of data can be assessed before the data itself becomes available.” Similar algorithms might be the best future alternative to data hoarding, especially when you realize that the “keep it just in case we need it” theory often later faces the “we can’t find it amongst all the stuff we kept” reality.
What Say You?
Have you found alternatives to data hoarding? Please share your thoughts by leaving a comment below.
During a podcast with Dr. Alexander Borek discussing his highly recommended new book Total Information Risk Management, he explained that “information is increasingly becoming an extremely valuable asset in organizations. The dependence on information increases as it becomes more valuable. As value and dependence increase, so does the likelihood of the risk that arises from not having the right information of the required quality for a business activity available at the right time.”
Borek referred to risk as the anti-value of information, explaining how the consequence of ineffective information management is poor data and information quality, which will lower business process performance and create operational, strategic, and opportunity risks in the business processes that are crucial to achieve an organization’s goals and objectives.
Information risk, however, as Borek explained, “also has a positive side to it: the opportunities that can be created. Your organization collects a lot of data every single day. Most of the data is stored in some kind of database and probably never used again. You should try to identify your hidden treasures in data and information. Getting it right can provide you with almost endless new opportunities, but getting it wrong not only makes you miss out on these opportunities, but also creates risks all over your business that prevent you from performing well.”
Since risk is the anti-value of information, Borek explained, “when you reduce risk, you create value, and you can use this value proposition to make the business case for information quality initiatives.”
While most business leaders will at least verbally acknowledge the value of information as an asset to the organization, few acknowledge the risk that negates the value of this asset when information management and governance is not a business priority. This means not just talking the talk about how information is an asset, but walking the walk by allocating the staffing and funding needed to truly manage and govern information as an asset—and mitigate the risk of information becoming a liability.
Whether your organization’s information maturity is aware, reactive, proactive, managed, or optimal, you must remain vigilant about information management and governance. If you need to assess your organization’s information maturity levels, check out the MIKE2.0 Information Maturity QuickScan.
The MIKE2.0 wiki defines the Chief Data Officer (CDO) as one that plays a key executive leadership role in driving data strategy, architecture, and governance as the executive leader for data management activities.
“Making the most of a company’s data requires oversight and evangelism at the highest levels of management,” Anthony Goldbloom and Merav Bloch explained in their Harvard Business Review blog post Your C-Suite Needs a Chief Data Officer.
Goldbloom and Bloch describe the CDO as being responsible for identifying how data can be used to support the company’s most important priorities, making sure the company is collecting the right data, and ensuring the company is wired to make data-driven decisions.
“I firmly believe the definition of a CDO role is a good idea,” Forrester analyst Gene Leganza blogged, but “there’s plenty to be worked out to make this effective. What would be the charter of this new role (and the organizational unit that would report to it), where would it report, and what roles would report into it? There are no easy answers as far as I can see.”
What about the CIO?
And if you are wondering whether your organization needs a CDO when you probably already have a Chief Information Officer (CIO), then “look at what we’ve asked CIOs to do,” Peter Aiken and Michael Gorman explained in their intentionally short book The Case for the Chief Data Officer. “They are responsible for infrastructure, application software packages, Ethernet connections, and everything in between. It’s an incredible range of jobs. If you look at a chief financial officer, they have a singular focus on finance, because finance and financial assets are a specific area the business cares about. Taking data as a strategic asset gives it unique capabilities, and when you take the characteristics of data and you see the breadth and scope of CIO functions, they don’t work together. It hasn’t worked, it’s not going to work, especially when you consider the other data plans coming down the pipeline.”
And there aren’t just other data plans coming down the pipeline. Our world is becoming, not just more data-driven, but increasingly data-constructed. “Global drivers have been shifting from valuing the making of things to the flow of intellectual capital,” Robert Hillard blogged. “This is the shift to an information economy which has most recently been dubbed digital disruption. There is no point, for instance, in complaining about there being less tax on music streaming than the manufacture, distribution, and sale of CDs. The value is just in a different place and most of it isn’t where it was.”
The Rise of a Second CDO?
“All businesses are now digital businesses,” Gil Press blogged. “The digitization of the entire business is spreading to all industries and all business functions and is threatening to make the central IT organization less relevant. Enter the newly-minted Chief Digital Officer expected to provide a unifying vision and develop a digital strategy, transforming existing processes and products and finding new digital-based profit and revenue opportunities. The role of the Chief Digital Officer is all about digital governance, the other CDO role—that of the Chief Data Officer—is all about data governance. With more and more digital data flowing throughout the organization, and going in and out through its increasingly porous borders, managing the quality, validity, and access to this asset is more important than ever.”
“The main similarity between the two roles,” Press explained, “is the general consensus that the new chiefs, whether of the digital or the data kind, should not report to the CIO. Theirs is a business function, while the CIO is perceived to be dealing with technology.”
“The CDO reports to the business,” Aiken and Gorman explained. “Business data architecture is a business function, not an IT function. In fact, the only data management areas that stay behind with the CIO are the actual development of databases and the tuning, backup, and recovery of the data delivery systems, with security shared between IT and the business.”
Hail to the Chiefs
“The central IT organization and CIOs may become irrelevant in the digital economy,” Press concluded. “Or, CIOs could use this opportunity to demonstrate leadership that is based on deep experience with and understanding of what data, big or small, is all about — its management, its analysis, and its use in the service of innovation, the driving force of any enterprise.”
The constantly evolving data-driven information economy is forcing enterprises to open their hailing frequencies to chiefs, both new and existing, sending a hail to the chiefs to figure out how data and information, and especially its governance, relate to their roles and responsibilities, and how they can collectively provide the corporate leadership needed in the digital age.
In 1668, the French philosopher and mathematician Edme Mariotte discovered what has come to be known as the “blind spot” in each one of our eyes, a region of the retina where the optic nerve connects the visual cortex to the back of the retina that has no rods or cones, so the corresponding areas of our visual field are incapable of registering light.
While this blind spot is surprisingly large (imagine the diameter of the moon in the night sky — 17 moons could fit into your blind spots), its effects are minimal because the blind spots in each of our eyes do not overlap, and so the information from one eye fills in the information lacking in the other.
As the philosopher Daniel Dennett describes our blind spot, there are no centers of the visual cortex “responsible for receiving reports from this area, so when no reports arrive, there is no one to complain. An absence of information is not the same as information about an absence.”
Daragh O Brien, in his recent article The Value of Null: The Paradox of Metrics in Data Governance, wrote about the classic information governance challenge of misunderstanding the meaning of a null value in a report. In this particular case, it was a report of issues being tracked by the data governance metrics defined by one of his clients.
The root of the problem was that only one business unit was actually reporting issues, causing executive management to misinterpret the absence of data governance metrics reported by other business units as the absence of data governance issues in those business units. This was making the business unit that was actually doing a good job with data governance look bad simply because they were the only ones actually measuring and reporting their data governance progress.
“Until you define that there is a thing you will measure as an indicator of your governance performance,” O Brien explained, “then there is nothing being measured. So the fact that my client’s peers were not publishing any metrics came down to how you interpreted the null set of metrics being produced. Ultimately, the paradox of metrics in data quality and data governance is that the simple act of measuring sets you up for attack because people have historically not had visibility of these issues and the data makes organizations ask hard questions of themselves.”
In other words, insightful metrics reveal the blind spots in an organization’s field of vision.
Effective data quality and data governance metrics must provide insight into data that is aligned with how the business uses data to support a business process, accomplish a business objective, or make a business decision. This will prevent organizational blindness caused when data quality and data governance is not properly measured within a business context and continually monitored.
So, when all is null on the metrics front, don’t assume that all is well behind the business lines.
In his recent Harvard Business Review blog post Are You Data Driven? Take a Hard Look in the Mirror, Tom Redman distilled twelve traits of a data-driven organization, the first of which is making decisions at the lowest possible level.
This is how one senior executive Redman spoke with described this philosophy: “My goal is to make six decisions a year. Of course that means I have to pick the six most important things to decide on and that I make sure those who report to me have the data, and the confidence, they need to make the others.”
“Pushing decision-making down,” Redman explained, “frees up senior time for the most important decisions. And, just as importantly, lower-level people spend more time and take greater care when a decision falls to them. It builds the right kinds of organizational capability and, quite frankly, appears to create a work environment that is more fun.”
I have previously blogged about how a knowledge-based organization is built upon a foundation of bottom-up business intelligence with senior executives providing top-down oversight (e.g., the strategic aspects of information governance). Following Redman’s advice, the most insightful top-down oversight is driving decision-making to the lowest possible level of a data-driven organization.
With the speed at which decisions must be made these days, organizations can not afford to risk causing a decision-making bottleneck by making lower-level employees wait for higher-ups to make every business decision. While faster decisions aren’t always better, a shorter decision-making path is.
Furthermore, in the era of big data, speeding up your data processing enables you to integrate more data into your decision-making processes, which helps you make better data-driven decisions faster.
Well-constructed policies are flexible business rules that empower employees with an understanding of decision-making principles, trusting them to figure out how to best apply them in a particular context.
If you want to pull your organization, and its business intelligence, up to new heights, then push down business decisions to the lowest level possible. Arm your frontline employees with the data, tools, and decision-making guidelines they need to make the daily decisions that drive your organization.
TODAY: Tue, March 28, 2017March2017