A Contrarian’s View of Unstructured Data

“If you analyzed the flow of digital data in 1980,” Stephen Baker wrote in his 2011 book Final Jeopardy: Man vs. Machine and the Quest to Know Everything, “only a smidgen of the world’s information had found its way into computers.”

“Back then, the big mainframes and the new microcomputers housed business records, tax returns, real estate transactions, and mountains of scientific data.  But much of the world’s information existed in the form of words—conversations at the coffee shop, phone calls, books, messages scrawled on Post-its, term papers, the play-by-play of the Super Bowl, the seven o’clock news.  Far more than numbers, words spelled out when humans were thinking, what they knew, what they wanted, whom they loved.  And most of those words, and the data they contained, vanished quickly.  They faded in fallible human memories, they piled up in dumpsters and moldered in damp basements.  Most of these words never reached computers, much less networks.”

However, during the era of big data, things have significantly changed.  “In the last decade,” Baker continued, “as billions of people have migrated their work, mail, reading, phone calls, and webs of friendships to digital networks, a giant new species of data has arisen: unstructured data.”

“It’s the growing heap of sounds and images that we produce, along with trillions of words.  Chaotic by nature, it doesn’t fit neatly into an Excel spreadsheet.  Yet it describes the minute-by-minute goings-on of much of the planet.  This gold mine is doubling in size every year.  Of all the data stored in the world’s computers and coursing through its networks, the vast majority is unstructured.”

One of Melinda Thielbar’s three questions of data science is: “Are these results actionable?”  As Baker explained, unstructured data describes the minute-by-minute goings-on of much of the planet, so the results of analyzing unstructured data must be actionable, right?

Although sentiment analysis of unstructured social media data is often lauded as a great example, late last year Augie Ray wrote a great blog post asking How Powerful Is Social Media Sentiment Really?

My contrarian’s view of unstructured data is that it is, in large part, gigabytes of gossip and yottabytes of yada yada digitized, rumors and hearsay amplified by the illusion-of-truth effect and succumbing to the perception-is-reality effect until the noise amplifies so much that its static solidifies into a signal.

As Roberta Wohlstetter originally defined the terms, signal is the indication of an underlying truth behind a statistical or predictive problem, and noise is the sound produced by competing signals.

The competing signals from unstructured data are competing with other signals in a digital world of seemingly infinite channels broadcasting a cacophony that makes one nostalgic for a luddite’s dream of a world before word of mouth became word of data, and before private thoughts contained within the neural networks of our minds became public thoughts shared within social networks, such as Twitter, Facebook, and LinkedIn.

“While it may seem heretical to say,” Ray explained, “I believe there is ample evidence social media sentiment does not matter equally in every industry to every company in every situation.  Social media sentiment has been elevated to God-like status when really it is more of a minor deity.  In most situations, what others are saying does not trump our own personal experiences.  In addition, while public sentiment may be a factor in our purchase decisions, we weigh it against many other important factors such as price, convenience, perception of quality, etc.”

Social media is not the only source of unstructured data, nor am I suggesting there’s no business value in this category of big data.  However, sometimes a contrarian’s view is necessary to temper unchecked enthusiasm, and a lot of big data is not only unstructured, but enthusiasm for it is often unchecked.

Category: Information Development