Everybody has a plan until they get punched in the face.
Who should do what on a Big Data project?
It seems like a logical and even necessary question, right? After all, Big Data is a big deal, and requires assistance from each line of business, the top brass, and IT, right?
Matt Ariker, Tim McGuire, and Jesko Perry recently wrote a HBR post attempting to answer this question. In Five Roles You Need on Your Big Data Team, the three advocate five “important roles to staff your advanced analytics bureau”:
- Data Hygienists
- Data Explorers
- Business Solution Architects
- Data Scientists
- Campaign Experts
To be sure, everyone can’t and shouldn’t do everything in an era of Big Data. I can’t tell you for certain that bifurcating roles like the authors recommend won’t work. Still, I just don’t buy the argument that Big Data lends itself to everything fitting neatly in to traditional roles.
Take data quality, for instance. As Jim Harris writes:
The quality of the data in the warehouse determines whether it’s considered a trusted source, but it faces a paradox similar to “which came first, the chicken or the egg?” Except for the data warehouse it’s “which comes first, delivery or quality?” However, since users can’t complain about the quality of data that hasn’t been delivered yet, delivery always comes first in data warehousing.
Agreed. Traditional data warehousing projects could be thought of in a more linear fashion. In most cases, organizations were attempting to aggregate–and report on–their data (read: data internal to the enterprise). Once that source was added, maintenance was fairly routine, at least compared to today’s datasets. These projects tended to be more predictable.
But what happens when much if not most relevant data stems from outside of the enterprise? What do we do when new data sources start popping up faster than ever? Mike Tyson’s quote at the top of this post has never been more apropos.
Simon Says: Big Data Is Not Predictable
My point is that IT projects have start and end dates. Amazon, Apple, Facebook, Twitter, Google, and other successful companies don’t view Big Data as “IT projects.” This is a potentially lethal mistake. For its part, Netflix views both Big Data and data visualization as ongoing processes; they are never finished. I make the same point in my last book.
When you starting thinking of Big Data as an initiative or project with traditionally defined roles, you’re on the road to failure. Don’t make “data hygenics” or “data exploring” the sole purview of a group, department, or individual. Encourage others to step out of the comfort zones, notice things, test hypotheses, and act upon them.
What say you?
Many organizations are wrapping their enterprise brain around the challenges of business intelligence, looking for the best ways to analyze, present, and deliver information to business users. More organizations are choosing to do so by pushing business decisions down in order to build a bottom-up foundation.
However, one question coming up more frequently in the era of big data is what should be the division of labor between computers and humans?
In his book Emergence: The Connected Lives of Ants, Brains, Cities, and Software, Steven Johnson discussed how the neurons in our human brains are only capable of two hundred calculations per second, whereas the processors in computers can perform millions of calculations per second.
This is why we should let the computers do the heavy lifting for anything that requires math skills, especially the statistical heaving lifting required by big data analytics. “But unlike most computers,” Johnson explained, “the brain is a massively parallel system, with 100 billion neurons all working away at the same time. That parallelism allows the brain to perform amazing feats of pattern recognition, feats that continue to confound computers—such as remembering faces or creating metaphors.”
As the futurist Ray Kurzweil has written, “humans are far more skilled at recognizing patterns than in thinking through logical combinations, so we rely on this aptitude for almost all of our mental processes. Indeed, pattern recognition comprises the bulk of our neural circuitry. These faculties make up for the extremely slow speed of human neurons.”
“Genuinely cognizant machines,” Johnson explained, “are still on the distant technological horizon, and there’s plenty of reason to suspect they may never arrive. But the problem with the debate over machine learning and intelligence is that it has too readily been divided between the mindless software of today and the sentient code of the near future.”
But even if increasingly more intelligent machines “never become self-aware in any way that resembles human self-awareness, that doesn’t mean they aren’t capable of learning. An adaptive information network capable of complex pattern recognition could prove to be one of the most important inventions in all of human history. Who cares if it never actually learns how to think for itself?”
Business intelligence in the era of big data and beyond will best be served if we let both the computers and the humans play to their strengths. Let’s let the computers calculate and the humans cogitate.
On October 28, 2012, the Oklahoma City Thunder traded star sixth-man James Harden to the Houston Rockets. The move was not entirely expected, as the team was unable to work out a long-term extension with Harden. Fans were disappointed, as this trade broke up the young core of the Western Conference champions. (Harden was looking for a max contract and the Thunder had two max players signed long-term already.*)
While the move itself wasn’t entirely unexpected, the data behind the move was even more surprising.
Rockets’ GM Daryl Morey comes from the Moneyball school of sports management. That is, all else equal, it’s better to make decisions based upon data than gut instinct. To this end, Morey had long coveted Harden, an incredibly efficient player.
As the following chart from HotShotCharts demonstrates, Harden naturally navigates to places on the floor that lend themselves to high expected values. (Click on the image to expand it).
You can noodle for days on the HSC site, looking at visual data from different teams, players, and arenas. For his part, Harden generally takes shorter three-pointers and layups. (See the red dots above.) He avoids long two-pointers because they have lower expected values. Note the low shot counts inside the arc but outside of the paint.
What’s more, field goal percentage (FGA) is a better gauge of player effectiveness. Players like Kobe Bryant, Allen Iverson, and Carmelo Anthony score a bunch of points, but they typically take far too many shots. (Even I would score ten points per game if you gave me enough shots, I’m not very good at hoops.)
Data is permeating every facet of business and, I’d argue, life. While not a complete substitute for common sense, we are seeing dataviz tools crystallize differences among companies, products, and even NBA players.
Relying exclusively on old standbys like Microsoft Excel leaves money on the table. Why not look at different ways to view your data? You may well be surprised at what you find.
What say you?
* The Thunder offered Harden $55.5 million over four years–$4.5 million less than the max deal Harden coveted and will get from the Rockets, sources told ESPN The Magazine.
“Who, what, and where, by what helpe, and by whose:
Why, how, and when, doe many things disclose. ”
- Thomas Wilson, The Arte of Rhetorique, 1560
Our human proclivity for story-telling as the primary method we have to communicate our culture — its values, tales, foibles and limitless possibilities — seems to have been ignored if not forgotten, by the standards we have for digital communications. Rather than build systems that promote human understanding about and appreciation for the complexity of human endeavor, we have built instead a somewhat technocratic structure to transmit ‘facts’ about ‘things’. A consequence of this mono-maniacal focus on ‘facts’ is that we risk the loss of humanity in the thicket of numbers and pseudo-semantic string values we pass around among ourselves.
Let’s instead look at a way to structure our communications that relies on time-tested concepts of ‘best practice’ story-telling. It may seem odd to talk of ‘best practices’ and story-telling in the same breath, though indeed we’re all familiar with these ideas.
The “5 Ws” are a good place to start. These are five or six or sometimes more, basic questions whose answers are essential to effective story-telling, often taught to students of journalism, scientific research, and police forensics. Stories which neglect to relay the entirety of the “Who-What-When-Where-Why” (the five Ws) of a topic too easily may leave listeners ‘semi-ignorant’ about the topic. In a court of law for instance, trials in which method (how) motive (why) and opportunity (when) are not fully explicated and validated, rightly end with verdicts of ‘not guilty’.
The same approach — providing the ‘full story’ about a topic — should undergird our methods of digital communications as well. Perhaps by doing so, much of the current debate about “context” for/of any particular semantic (RDF) dataset might be more tractable and resolvable.
The practical significance of a “5 Ws” approach is that it can directly provide a useful metric about RDF datasets. A low score suggests the dataset makes small effort to give the ‘full story’ about its set of topics, while a high score would indicate good conformance to the precepts of best practice communications. In the real world, for instance, a threshold for this metric could be specified in contracts which envision information exchanged between its parties.
While a high-score of course wouldn’t attest to the reliability or consistency of each answer to the “5 Ws” for a given topical datastream, a low-score is indicative that the ‘speaker’ is merely spouting facts (that is, the RDF approach, which is to “say anything about anything”) best used to accentuate one’s own story but not useful as a complete recounting in its own right.
A “best practice communications” metric might be formulated by examining the nature of the property values associated with a resource. If entities are each a subclass of a 5W class, then it can be a matter of “provisionally checking the box” to the extent that some answer exists: a 100% score might indicate the information about a given topic is nominally complete while a 0% score indicates that merely a reference to the existence of the resource has been provided.
Viewing semantic datastreams each as a highly formalized story (or set of stories), then applying quality criteria developed by professional communicators as long ago as Cicero, can provide valuable insights when building high quality data models and transmitting high quality datastreams.
Like many, I’m one who’s been around since the cinder block days, once entranced by shiny Tektronix tubes stationed nearby a dusty card sorter. After years using languages as varied as Assembler through Scheme, I’ve come to believe the shift these represented, from procedural to declarative, has well-improved the flexibility of software organizations produce.
Interest has now moved towards an equally flexble representation of data. In the ‘old’ days when an organization wanted to collect a new data-item about, say, a Person, then a new column would first be added by a friendly database administrator to a Person Table in one’s relational database. Very inflexible.
The alternative — now widely adopted — reduces databases to a simple forumulation, one that eliminates Person and other entity-specific tables altogether. These “triple-stores” basically have just three columns — Subject, Predicate and Object — in which all data is stored. Triple-stores are often called ‘self-referential’ because first, the type of a Subject of any row in a triple-store is found in a different row (not column) in the triple-store and second, definitions of types are found in different rows of the triple-store. The benefits? Not only is the underlying structure of a triple-store unchanging, but also stand-alone metadata tables (tables describing tables) are unnecessary.
Why? Static relational database tables do work well enough to handle transactional records whose dataitems are usually well-known in advance; the rate of change in those business processes is fairly low, so that the cost of database architectures based on SQL tables is equally low. What, then, is driving the adoption of triple-stores?
The scope of business functions organizations seek to automate has enlarged considerably: the source of new information within an organization is less frequently “forms” completed by users, now more frequently raw text from documents; tweets; blogs; emails; newsfeeds; and other ‘social’ web and internal sources; which have been produced received &or retrieved by organizations.
Semantic technologies are essential components of Natural Language Processing (NLP) applications which extract and convert, for instance, all proper nouns within a text into harvestable networks of “information nodes” found in a triple-store. In fact during such harvesting, context becomes a crucial variable that can change with each sentence analyzed from the text.
Bringing us to my primary distinction between really semantic and non-semantic applications: really semantic applications mimic a human conversation, where the knowledge of an indivdual in a conversation is the result of a continuous accrual of context-specific facts, context-specific definitions, even context-specific contexts. As a direct analogy, Wittgenstein, a modern giant of philosophy, calls this phenomena Language Games to connote that one’s techniques and strategies for analysis of a game’s state and one’s actions, is not derivable in advance — it comes only during the play of the game, i.e., during processing of the text corpora.
Non-semantic applications on the other hand, are more similar to rites, where all operative dialogs are pre-written, memorized, and repeated endlessly.
This analogy to human conversations (to ‘dynamic semantics’) is hardly trivial; it is a dominant modelling technique among ontologists as evidenced by development of, for instance, Discourse Representation Theory (among others, e.g., legal communities have a similar theory, simply called Argumentation) whose rules are used to build Discourse Representation Structures from a stream of sentences that accommodate a variety of linguistic issues including plurals, tense, aspect, generalized quantifiers, anaphora and others.
“Semantic models” are an important path towards a more complete understanding of how humans, when armed with language, are able to reason and draw conclusions about the world. Relational tables, however, in themselves haven’t provided similar insight or re-purposing in different contexts. This fact alone is strong evidence that semantic methods and tools must be prominent in any organization’s technology plans.
I recently hosted a well-attended webinar on Big Data in the public sector. It went reasonably well enough and, at the end, I answered some questions from inquisitive listeners.
Now, most were fairly standard queries. I sensed that there was a little skepticism about the power of Big Data among attendees. To be sure, there’s no shortage of hype these days. I also received the completely expected question, “How can you determine the value of Big Data? How do I calculate its ROI?”
I’ve ranted on this topic before, and I just don’t agree with those who won’t move before they have precisely quantified the ROI of Big Data. At best, these are SWAGs. At worst, they are biased calculations driven by vested consulting firms and vendors trying to hawk their wares.
Replicating the Old in the New
With any new technology or application, employees and enterprises often seem to fall into the same traps. We are creatures of habit, after all. Over my career, I have repeatedly seen organizations deploy new technologies and make similar mistakes. Among the worst: employees simply replicated what they were doing before in the old system or reporting application. It became old hat: I would often just help people just create their previous standard reports in the new system.
Many people just didn’t care about the new system’s enhanced functionality. Put simply, these people just didn’t want to learn.
Again, I understand this mind-set, but I don’t agree with it. When you’re doing what you just did before, you squander massive and unprecedented opportunities. You fail to explore and discover new insights. In these types of organizations, I’d wager that pre-implementation ROI calculations exceed real-world results.
On occasion, however, I have worked with people curious about the enhanced functionality of the new system or reporting tool. They didn’t just want to set up old standard reports. They wanted to learn and explore new things. What could they do now that they couldn’t do before? In cases like these, I would “take the over” on any ROI estimate.
First, don’t think of Big Data as “another application.” It’s not.
Second, realize that ROI calculations are often imprecise at best. What’s more, as books like The Halo Effect: … and the Eight Other Business Delusions That Deceive Managers manifest, the world doesn’t stand still. How can any ROI model account for what may happen when many of the unknowns are unknown?
Either way, if you get Big Data or you don’t, you can twist ROI calculations to prove your point.
What say you?
Content Contributor Contest
Hey members! Ready to showcase your IM skills? Contribute a new wiki article or expand an existing article between now and September 31, 2013 and you’ll be automatically entered into a drawing for a free iPad mini!
Contest Terms and Conditions:
Eligibility is open to any new or existing MIKE2.0 member who makes a significant wiki article expansion (expansion must be greater than half the existing wiki article) or creates a new article. Please note that new articles MUST be related to our core IM competencies such as Business Intelligence, Agile Development, Big Data, Information Governance, etc. Off-topic, promotional and spam articles will be automatically disqualified.
Drawing to take place on October 1, 2013. Winning community member will be notified at the email address listed on their user profile. Contest not open to existing MGA board members, MIKE2.0 team, or contractors.
Need somewhere to start?
How about the [most wanted pages]; or the pages we know [need more work] ; or even the [stub]  that somebody else has started, but hasn’t been able to finish. Still not sure? Just contact us at firstname.lastname@example.org and we’ll be happy to help refer you to some articles that might be a good fit for your expertise.
In his recent blog post What’s the Matter with ‘Meta’?, John Owens lamented the misuse of the term metadata — the metadata about metadata — when discussing matters within the information management industry, such as metadata’s colorful connection with data quality. As Owens explained, “the simplest definition for ‘Meta Data’ is that it is ‘data about data’. To be more precise, metadata describes the structure and format (but not the content) of the data entities of an enterprise.” Owens provided several examples in his blog post, which also received great commentary.
Some commenters resisted oversimplifying metadata as data about data, including Rob Karel who, in his recent blog post Metadata, So Mom Can Understand, explained that “at its most basic level, metadata is something that helps to better describe the data you’re trying to remember.”
This metadata crisis reminded me of the book Moonwalking with Einstein: The Art and Science of Remembering Everything, where author Joshua Foer described a strange kind of forgetfulness that psychologists have dubbed the Baker-baker Paradox.
As Foer explained it: “A researcher shows two people the same photograph of a face and tells one of them the guy is a baker and the other that his last name is Baker. A couple days later, the researcher shows the same two guys the same photograph and asks for the accompanying word. The person who was told the man’s profession is much more likely to remember it than the person who was given his surname. Why should that be? Same photograph. Same word. Different amount of remembering.”
“When you hear that the man in the photo is a baker,” Foer explained, “that fact gets embedded in a whole network of ideas about what it means to be a baker: He cooks bread, he wears a big white hat, he smells good when he comes home from work.”
“The name Baker, on the other hand,” Foer continued, “is tethered only to a memory of the person’s face. That link is tenuous, and should it dissolve, the name will float off irretrievably into the netherworld of lost memories. (When a word feels like it’s stuck on the tip of the tongue, it’s likely because we’re accessing only part of the neural network that contains the idea, but not all of it.)”
“But when it comes to the man’s profession,” Foer concluded, “there are multiple strings to reel the memory back in. Even if you don’t at first remember that the man is a baker, perhaps you get some vague sense of breadiness about him, or see some association between his face and a big white hat, or maybe you conjure up a memory of your own neighborhood bakery. There are any number of knots in that tangle of associations that can be traced back to his profession.”
Metadata makes data better, helping us untangle the knots of associations among the data and information we use everyday. Whether we be bakers or Bakers, or professions or people described by other metadata, the better we can describe ourselves and our business, the better our business will be.
Although we may not always agree on the definitions demarcating metadata, data, and information, let’s not forget that what matters most is enabling better business the best we can.
Right now, researchers are working around the world to find ways of restoring sight to the blind by creating a bionic eye. The closest analogy is the bionic ear, more properly called a cochlear implant, which works by directly stimulating the auditory nervous system.
While the direct targeting of retinal tissue is analogous to the bionic ear, the implications for our grandchildren and great grandchildren go well beyond the blind community who are the intended recipients.
In the decades since the first cochlear implant, the sole objective of research has been to improve the quality of the sound heard. There is no hint of a benefit in adding additional information into the stream.
But the visual cortex isn’t the auditory nervous system, it is very close to being a direct feed into brain.
The bionic eye could be very different to the bionic ear. In the first instance, it is obvious that a direct feed of location data from the web could be added to the stream to help interpret the world around the user. It isn’t much of leap to go beyond navigation and make the bionic eye a full interface to the Internet.
By the time the bionic eye is as mature as the bionic ear is today, the information revolution should be complete and we will well and truly be living in an information economy. Arguably the most successful citizens will navigate information overload with ease. But how will they do it?
Today’s multi-dimensional tools for navigating complex information just won’t cut it. Each additional dimension that the human mind can understand reduces the number of “small world” steps needed to solve information problems. You can read more about the Small Worlds measure in chapter 5 of Information-Driven Business or find a summary of the technique in MIKE2.0.
While normal visual tools can only support two or at most three dimensions, business intelligence tools often try to add an extra one or even two through hierarchies or filters. However, these representations are seldom very successful and are only useful to highly skilled users.
A direct feed into the brain, even if it has to go through the visual nervous system, could provide so much more than a convenient form of today’s Google Glasses. Properly trained, the brain will adapt visual data to meet information needs in very efficient and surprising ways that go well beyond in-mind images. It is entirely conceivable that by the late 21st century people will think in a dozen dimensions and navigate terabytes of complex information with ease using an “in-eye” machine connection.
Of course, the implications go well beyond the technology. How will such implants be used by the wider population? Will the desire for the ultimate man-machine interface be so overwhelming as to overcome ethical concerns of operating on otherwise healthy patients?
The implications are also social and will affect parents (perhaps even our children). The training required for the brain could overwhelm many adult minds, but may be most effective when available from an early age. In fact, it could be that children who are implanted with a bionic eye will derive the greatest advantages from the information economy in the latter decades of the 21st century.
Imagine the pressure on parents in future decades to decide which brand of implant to install in their children. It makes today’s iOS, Android and Windows smartphone debates pale into insignificance!
In the era of big data, we’re confronted by the question Brenda Somich recently blogged: How do you handle information overload? “Does today’s super-connected and informative online environment allow us to work to our potential?” Somich asked. “Is all this information really making us smarter?”
I have blogged about how much of the unstructured data that everyone is going gaga over is gigabytes of gossip and yottabytes of yada yada digitized. While most of our verbalized thoughts were always born this way, with word of mouth becoming word of data, big data is making little data monsters of us all.
In a way, we have become addicted to data. In her post, Somich discussed how we have become so obsessed with checking emails, news feeds, blog posts, and social media status updates, that even after hours of using information have gone by, we are still searching for our next data fix. Our smartphones have become our constant companions, ever-present enablers reminiscent of the nickname that the once most popular smartphone went by — CrackBerry.
In his book Hamlet’s BlackBerry: Building a Good Life in the Digital Age, William Powers explained that “in the sixteenth century, when information was physically piling up everywhere, it was the ability to erase some of it that afforded a sense of empowerment and control.”
“In contrast, the digital information that weighs on us today exists in a nonphysical medium, and this is part of the problem. We know it’s out there, and we have words to represent and quantify it. An exabyte, for instance, is a million million megabytes. But that doesn’t mean much to me. Where is all that data, exactly? It’s everywhere and nowhere at the same time. We’re physical creatures who perceive and know the world through our bodies, yet we now spend much of our time in a universe of disembodied information. It doesn’t live here with us, we just peer at it through a two-dimensional screen. At a very deep level of the consciousness, this is arduous and draining.”
Without question, big data is forcing us to revisit information overload. But sometimes it’s useful to remember that the phrase is over forty years old now — and it originally expressed the concern, not about the increasing amount of information, but about our increasing access to information.
Just because we now have unprecedented access to an unimpeded expansion of information doesn’t mean we need to access it right now. Just because disembodied information is everywhere doesn’t mean that our bodies need to consume it.
One thing we must do, therefore, to avoid such snafus as the haunting hyper-connected hyperbole of the infinite inbox, is acknowledge the infinitesimal value of most of the information we consume.
When you are feeling overwhelmed by the amount of information you have access to, stop for a moment and consider how underwhelming most of it is. I think part of the reason we keep looking for more information is because we’re so unsatisfied with the information we’ve found.
Although information overload is a real concern and definitely does frequently occur, far more often I think it is information underwhelm that is dragging us down.
How much of the content of those emails, news feeds, blog posts, and social media status updates you read yesterday, or even earlier today, do you actually remember? If you’re like me, probably not much, which is why we need to mind the gap between our acquisition and application of information.
As Anton Chekhov once said, “knowledge is of no value unless you put it into practice.” By extension, consuming information is of no value unless you put it to use. And an overwhelming amount of the information now available to us is so underwhelming that it’s useless to consume.
TODAY: Sun, December 8, 2013December2013