Posts Tagged ‘unstructured data’
I have previously blogged about how more data might soon be generated by things than by humans, using the phrase the Internet of Humans to not only differentiate human-generated data from the machine-generated data from the Internet of Things, but also as a reminder that humans are a vital source of knowledge that no amount of data from any source could ever replace.
The Internet of Things is the source of the category of big data known as sensor data, which is often the new type you come across while defining big data that requires you to start getting to know NoSQL, and which differentiates machine-generated data from the data generated directly by humans typing, taking pictures, recording videos, scanning bar codes, etc.
One advantage of machine-generated data that’s often alleged is its inherently higher data quality due to humans being removed from the equation. In this blog post, let’s briefly explore that concept.
Is it hot in here or is it just me?
Let’s use the example of accurately recording the temperature in a room, something that a thing, namely a thermostat, has been helping us do long before the Internet of Anything ever existed.
(Please note, for the purposes of keeping this discussion simple, let’s ignore the fact that the only place where room temperature matches the thermostat is at the thermostat — as well as how that can be compensated for with a network of sensors placed throughout the room).
A thermostat only displays the current temperature, so if I want to keep a log of temperature changes over time, I could do this by writing my manual readings down in a notebook. Here is where the fallible human, in this case me, enters the equation and could cause a data quality issue.
For example, the temperature in the room is 71 degrees Fahrenheit, but I incorrectly write it down as 77 degrees (or perhaps my sloppy handwriting caused even me to later misinterpret the 1 as 7). Let’s also say that I wanted to log the room temperature every hour. This would require me to physically be present to read the thermostat every hour on the hour. Obviously, I might not be quite that punctual and I could easily forget to take a reading (or several), thereby preventing it from being recorded.
The Quality of Things
Alternatively, using an innovative new thermostat that wirelessly transmits the current temperature, at precisely defined time intervals, to an Internet-based application certainly eliminates the possibility of the two human errors in my example.
However, there are other ways a data quality issue could occur. For example, the thermostat may not be accurately displaying the temperature because it is not calibrated correctly. Another thing is that a power loss, mechanical failure, loss of Internet connection, or the Internet-based application crashing could prevent the temperature readings from being recorded.
The point I am trying to raise is that although the quality of things entered by things certainly has some advantages over the quality of things entered by humans, nothing could be further from the truth than to claim that machine-generated data is immune to data quality issues.
Richard Ordowich, commenting on my Hail to the Chiefs post, remarked how “most organizations need to improve their data literacy. Many problems stem from inadequate data definitions, multiple interpretations and understanding about the meanings of data. Skills in semantics, taxonomy and ontology as well as information management are required. These are skills that typically reside in librarians but not CDOs. Perhaps hiring librarians would be better than hiring a CDO.”
I responded that maybe not even librarians can save us by citing The Library of Babel, a short story by Argentine author and librarian Jorge Luis Borges, which is about, as James Gleick explained in his book The Information: A History, A Theory, A Flood, “the mythical library that contains all books, in all languages, books of apology and prophecy, the gospel and the commentary upon that gospel and the commentary upon the commentary upon the gospel, the minutely detailed history of the future, the interpolations of all books in all other books, the faithful catalogue of the library and the innumerable false catalogues. This library (which others call the universe) enshrines all the information. Yet no knowledge can be discovered there, precisely because all knowledge is there, shelved side by side with all falsehood. In the mirrored galleries, on the countless shelves, can be found everything and nothing. There can be no more perfect case of information glut.”
More than a century before the rise of cloud computing and the mobile devices connected to it, the imagination of Charles Babbage foresaw another library of Babel, one where “the air itself is one vast library, on whose pages are forever written all that man has ever said or woman whispered.” In a world where word of mouth has become word of data, sometimes causing panic about who may be listening, Babbage’s vision of a permanent record of every human utterance seems eerily prescient.
Of the cloud, Gleick wrote about how “all that information—all that information capacity—looms over us, not quite visible, not quite tangible, but awfully real; amorphous, spectral; hovering nearby, yet not situated in any one place. Heaven must once have felt this way to the faithful. People talk about shifting their lives to the cloud—their informational lives, at least. You may store photographs in the cloud; Google is putting all the world’s books into the cloud; e-mail passes to and from the cloud and never really leaves the cloud. All traditional ideas of privacy, based on doors and locks, physical remoteness and invisibility, are upended in the cloud.”
“The information produced and consumed by humankind used to vanish,” Gleick concluded, “that was the norm, the default. The sights, the sounds, the songs, the spoken word just melted away. Marks on stone, parchment, and paper were the special case. It did not occur to Sophocles’ audiences that it would be sad for his plays to be lost; they enjoyed the show. Now expectations have inverted. Everything may be recorded and preserved, at least potentially: every musical performance; every crime in a shop, elevator, or city street; every volcano or tsunami on the remotest shore; every card played or piece moved in an online game; every rugby scrum and cricket match. Having a camera at hand is normal, not exceptional; something like 500 billion images were captured in 2010. YouTube was streaming more than a billion videos a day. Most of this is haphazard and unorganized.”
The Library of Babel is no longer fiction. Big Data is the Library of Babel.
In the era of big data, we’re confronted by the question Brenda Somich recently blogged: How do you handle information overload? “Does today’s super-connected and informative online environment allow us to work to our potential?” Somich asked. “Is all this information really making us smarter?”
I have blogged about how much of the unstructured data that everyone is going gaga over is gigabytes of gossip and yottabytes of yada yada digitized. While most of our verbalized thoughts were always born this way, with word of mouth becoming word of data, big data is making little data monsters of us all.
In a way, we have become addicted to data. In her post, Somich discussed how we have become so obsessed with checking emails, news feeds, blog posts, and social media status updates, that even after hours of using information have gone by, we are still searching for our next data fix. Our smartphones have become our constant companions, ever-present enablers reminiscent of the nickname that the once most popular smartphone went by — CrackBerry.
In his book Hamlet’s BlackBerry: Building a Good Life in the Digital Age, William Powers explained that “in the sixteenth century, when information was physically piling up everywhere, it was the ability to erase some of it that afforded a sense of empowerment and control.”
“In contrast, the digital information that weighs on us today exists in a nonphysical medium, and this is part of the problem. We know it’s out there, and we have words to represent and quantify it. An exabyte, for instance, is a million million megabytes. But that doesn’t mean much to me. Where is all that data, exactly? It’s everywhere and nowhere at the same time. We’re physical creatures who perceive and know the world through our bodies, yet we now spend much of our time in a universe of disembodied information. It doesn’t live here with us, we just peer at it through a two-dimensional screen. At a very deep level of the consciousness, this is arduous and draining.”
Without question, big data is forcing us to revisit information overload. But sometimes it’s useful to remember that the phrase is over forty years old now — and it originally expressed the concern, not about the increasing amount of information, but about our increasing access to information.
Just because we now have unprecedented access to an unimpeded expansion of information doesn’t mean we need to access it right now. Just because disembodied information is everywhere doesn’t mean that our bodies need to consume it.
One thing we must do, therefore, to avoid such snafus as the haunting hyper-connected hyperbole of the infinite inbox, is acknowledge the infinitesimal value of most of the information we consume.
When you are feeling overwhelmed by the amount of information you have access to, stop for a moment and consider how underwhelming most of it is. I think part of the reason we keep looking for more information is because we’re so unsatisfied with the information we’ve found.
Although information overload is a real concern and definitely does frequently occur, far more often I think it is information underwhelm that is dragging us down.
How much of the content of those emails, news feeds, blog posts, and social media status updates you read yesterday, or even earlier today, do you actually remember? If you’re like me, probably not much, which is why we need to mind the gap between our acquisition and application of information.
As Anton Chekhov once said, “knowledge is of no value unless you put it into practice.” By extension, consuming information is of no value unless you put it to use. And an overwhelming amount of the information now available to us is so underwhelming that it’s useless to consume.
In his 1938 collection of essays World Brain, H. G. Wells explained that “it is not the amount of knowledge that makes a brain. It is not even the distribution of knowledge. It is the interconnectedness.”
This brought to my brain the traditional notion of data warehousing as the increasing accumulation of data, distributing information across the organization, and providing the knowledge necessary for business intelligence.
But is an enterprise data warehouse the Enterprise Brain? Wells suggested that interconnectedness is what makes a brain. Despite Ralph Kimball’s definition of a data warehouse being the union of its data marts, more often than not a data warehouse is a confederacy of data silos whose only real interconnectedness is being co-located on the same database server.
Looking at how our human brains work in his book Where Good Ideas Come From, Steven Johnson explained that “neurons share information by passing chemicals across the synaptic gap that connects them, but they also communicate via a more indirect channel: they synchronize their firing rates, what neuroscientists call phase-locking. There is a kind of beautiful synchrony to phase-locking—millions of neurons pulsing in perfect rhythm.”
The phase-locking of neurons pulsing in perfect rhythm is an apt metaphor for the business intelligence provided by the structured data in a well-implemented enterprise data warehouse.
“But the brain,” Johnson continued, “also seems to require the opposite: regular periods of electrical chaos, where neurons are completely out of sync with each other. If you follow the various frequencies of brain-wave activity with an EEG, the effect is not unlike turning the dial on an AM radio: periods of structured, rhythmic patterns, interrupted by static and noise. The brain’s systems are tuned for noise, but only in controlled bursts.”
Scanning the radio dial for signals amidst the noise is an apt metaphor for the chaos of unstructured data in external sources (e.g., social media). Should we bring order to chaos by adding structure (or at least better metadata) to unstructured data? Or should we just reject the chaos of unstructured data?
Johnson recounted research performed in 2007 by Robert Thatcher, a brain scientist at the University of South Florida. Thatcher studied the vacillation between the phase-lock (i.e., orderly) and chaos modes in the brains of dozens of children. On average, the chaos mode lasted for 55 milliseconds, but for some children it approached 60 milliseconds. Thatcher then compared the brain-wave scans with the children’s IQ scores, and found that every extra millisecond spent in the chaos mode added as much as 20 IQ points, whereas longer spells in the orderly mode deducted IQ points, but not as dramatically.
“Thatcher’s study,” Johnson concluded, “suggests a counterintuitive notion: the more disorganized your brain is, the smarter you are. It’s counterintuitive in part because we tend to attribute the growing intelligence of the technology world with increasingly precise electromechanical choreography. Thatcher and other researchers believe that the electric noise of the chaos mode allows the brain to experiment with new links between neurons that would otherwise fail to connect in more orderly settings. The phase-lock [orderly] mode is where the brain executes an established plan or habit. The chaos mode is where the brain assimilates new information.”
Perhaps the Enterprise Brain also requires both orderly and chaos modes, structured and unstructured data, and the interconnectedness between them, forming a digital neural network with orderly structured data firing in tandem, while the chaotic unstructured data assimilates new information.
Perhaps true business intelligence is more disorganized than we have traditionally imagined, and perhaps adding a little disorganization to your Enterprise Brain could make your organization smarter.
“If you analyzed the flow of digital data in 1980,” Stephen Baker wrote in his 2011 book Final Jeopardy: Man vs. Machine and the Quest to Know Everything, “only a smidgen of the world’s information had found its way into computers.”
“Back then, the big mainframes and the new microcomputers housed business records, tax returns, real estate transactions, and mountains of scientific data. But much of the world’s information existed in the form of words—conversations at the coffee shop, phone calls, books, messages scrawled on Post-its, term papers, the play-by-play of the Super Bowl, the seven o’clock news. Far more than numbers, words spelled out when humans were thinking, what they knew, what they wanted, whom they loved. And most of those words, and the data they contained, vanished quickly. They faded in fallible human memories, they piled up in dumpsters and moldered in damp basements. Most of these words never reached computers, much less networks.”
However, during the era of big data, things have significantly changed. “In the last decade,” Baker continued, “as billions of people have migrated their work, mail, reading, phone calls, and webs of friendships to digital networks, a giant new species of data has arisen: unstructured data.”
“It’s the growing heap of sounds and images that we produce, along with trillions of words. Chaotic by nature, it doesn’t fit neatly into an Excel spreadsheet. Yet it describes the minute-by-minute goings-on of much of the planet. This gold mine is doubling in size every year. Of all the data stored in the world’s computers and coursing through its networks, the vast majority is unstructured.”
One of Melinda Thielbar’s three questions of data science is: “Are these results actionable?” As Baker explained, unstructured data describes the minute-by-minute goings-on of much of the planet, so the results of analyzing unstructured data must be actionable, right?
Although sentiment analysis of unstructured social media data is often lauded as a great example, late last year Augie Ray wrote a great blog post asking How Powerful Is Social Media Sentiment Really?
My contrarian’s view of unstructured data is that it is, in large part, gigabytes of gossip and yottabytes of yada yada digitized, rumors and hearsay amplified by the illusion-of-truth effect and succumbing to the perception-is-reality effect until the noise amplifies so much that its static solidifies into a signal.
As Roberta Wohlstetter originally defined the terms, signal is the indication of an underlying truth behind a statistical or predictive problem, and noise is the sound produced by competing signals.
The competing signals from unstructured data are competing with other signals in a digital world of seemingly infinite channels broadcasting a cacophony that makes one nostalgic for a luddite’s dream of a world before word of mouth became word of data, and before private thoughts contained within the neural networks of our minds became public thoughts shared within social networks, such as Twitter, Facebook, and LinkedIn.
“While it may seem heretical to say,” Ray explained, “I believe there is ample evidence social media sentiment does not matter equally in every industry to every company in every situation. Social media sentiment has been elevated to God-like status when really it is more of a minor deity. In most situations, what others are saying does not trump our own personal experiences. In addition, while public sentiment may be a factor in our purchase decisions, we weigh it against many other important factors such as price, convenience, perception of quality, etc.”
Social media is not the only source of unstructured data, nor am I suggesting there’s no business value in this category of big data. However, sometimes a contrarian’s view is necessary to temper unchecked enthusiasm, and a lot of big data is not only unstructured, but enthusiasm for it is often unchecked.
An interesting article in BusinessWeek on Big Data recently caught my eye. The article mentions different applications that allow organizations to make sense of the vast–and exponentially increasing–amount of unstructured data out there. From the piece:
“When the amount of data in the world increases at an exponential rate, analyzing that data and producing intelligence from it becomes very important,” says Anand Rajaraman, senior vice-president of global e-commerce at Wal-Mart and head of @WalmartLabs, the retailer’s division charged with improving its use of the Web.
More than ever, today intelligent businesses are trying to make sense of millions of tweets, blog posts, comments, reviews, and other form for unstructured data. The obvious question becomes, “How?”
I’ve written before on this site about collaborative filtering and semantic technologies. For many reasons beyond the scope of this post, companies such as Apple, Google, Facebook, and Amazon benefit from crowdsourcing and the law of large numbers much more than traditional companies. For instance, Apple knows the following about its customers who buy apps through its AppStore:
- which customers like which apps
- which customers buy (and like) other apps based on the purchase of the first app
- which customers are more likely to consider buying apps in the same category
Few businesses have the same level of knowledge about their customers. Apple is the exception that proves the rule. In other words, rare is the organization with access to detailed data on millions of its customers, structured or otherwise. What’s a “normal” company to do? Is there nothing they can do but watch from the sidelines?
In a word, no. These emerging applications show tremendous promise.
Now, I won’t pretend to have intimate knowledge of each of these data-mining applications and projects. At a high level, they are designed to help large organizations interpret vast amounts of data. Clearly, developers out there have recognized the need for such applications and have built them according to what they think the market wants.
One application equipped to potentially make sense of all of this unstructured data is Hadoop, an open source development project. From its website, “The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model.” Certainly worth looking into.
The Benefits of Free
Imagine for a moment that you’re a mid-level manager in a large organization. You see the need for a tool that would help you mine data and attempt to find hidden patterns–and ultimately knowledge. Excel just doesn’t cut it. You’d love to see what Hadoop or one of its equivalents can do. Yet, you’re not about to fall on the sword for an unproven product in tight economic times.
Free and open source tools are certainly worth considering. Download Hadoop or another application and begin playing with it. Network with people online and off. Ask them questions. Noodle with different data sets and see if you learn anything about your company, its customers, and underlying trends and drivers. Worst case scenario: you waste a little time.
In any organization, the traditional RFP process can be extremely cumbersome. Times are tight, and it’s entirely possible that even potentially valuable projects like mining unstructured data may not get the go-ahead. What’s more, your organization may have established relationships with companies like IBM that offer proprietary applications and services in the BI space. And, to be sure, Hadoop and other open source/free tools may not meet all of your organization’s needs.
All of this is to say that open source software is no panacea in any case–and here is no exception. However, doesn’t it behoove you to see what’s out there before making the case for a large expenditure–one that may ultimately not succeed? Is there any real harm in downloading a piece of software just to see what it can do?
What say you?
TODAY: Tue, March 28, 2017March2017