Archive for the ‘Information Development’ Category
For years now the industry has confidently predicted that Near Field Communication (NFC) would be the vehicle by which we would all abandon our leather wallets and embrace our smartphones for electronic payments.
It hasn’t happened and I don’t believe NFC is the solution. Here’s what we can learn from the failure of NFC to take off. As enticing as the idea is of using our phone for payments is, there are two problems that the focus on NFC fails to solve.
The first is the barrier of adoption. Until consumers can be confident that all of their purchases can be made with their phone, they don’t dare leave their wallets behind. Similarly, until retailers believe that enough of their consumers aren’t carrying conventional wallets they won’t invest sufficiently to encourage take-up.
The second, and potentially more serious, issue is that the ergonomics are all wrong. NFC payments have been designed to mimic the way retail has been done for centuries with the consumer picking-up products from shelves and taking them to a payment terminal when they’ve made their selection.
Designers of the next generation of payments solution need to step away from facilitating the exchange of currency between consumer and retailer and think about the whole transaction. The highest priority for both shopper and shopkeeper is to rid the store of lines at the checkout and ideally eliminate the final step altogether. If they can do away with shoplifting at the same time then all the better!
While the future of NFC is uncertain, the one technology that we know will have a place in years to come is location services. The ability to know where smartphones, or even wearable machines, are is fundamental to a whole range of new services. Increasingly, knowing that a customer is in a store is enough to establish the beginning of a transaction. It is likely that the early adopters will be in areas such as food service where cafes will enter a virtual contract with their regular customers.
For electronic wallet adoption to really take-off, the supermarkets and department stores need to have a way to allow the store to pair the goods with the customer who is picking them up. Ideally a contract will be established when a customer walks out of the store with the product, perhaps making shoplifting a thing of the past!
However, anyone familiar with this problem will immediately identify that the product identification is the real hurdle with this compelling picture. While Radio-Frequency Identification (RFID) is the most obvious approach, the cost of adding a tag to each product on the shelf is simply too great.
Until someone solves the problem of pairing the product with the customer there is little to make smartphone-based wallets really compelling. However the advantage of eliminating the checkout and reducing shrinkage is so large there is a ready market that will encourage rapid innovation.
The surprising innovators to watch are the drug companies who are keen to find sensors to attach to their medicines to track drug absorption and improve inventory management. If they can make technology which is simple and cheap enough to add to a tablet it is likely that it can be printed onto the packaging of consumer products as simply as a barcode is today.
Is contemporary dataviz really new?
So would argue no. After all, many of the same reporting, business intelligence, and analytics applications also provide at least rudimentary levels of data visualization and have for some time now. Yes, there are “pure” dataviz tools like Tableau, but clear lines of demarcation among these terms just do not exist. In fact, lines between terms blur considerably.
But I would argue that modern-day dataviz really is new. This begs the natural question: How is contemporary dataviz fundamentally different than KPIs, dashboards, and other reporting tools?
In short, dataviz is about data exploration and discovery, not traditional reporting. To me, those trried and true terms always implied that that the organization, department, group, or individual employee knew exactly what to ask and measure. Example included:
- How many sales per square foot are we seeing?
- What’s employee turnover?
- What’s our return on assets?
These are still important questions, even in an era of Big Data. But contemporary dataviz is less, well, certain. There’s a genuine curiosity at play when you don’t know exactly when you don’t know what you’re looking for, much less what you’ll find.
In keeping with the data discovery theme of this post, why not try to answer my question about dataviz using dataviz? Still, while it’s only a proxy, I find Google Trends to be a very useful tool for answering questions about what’s popular/new, where, when, and how things are changing. For instance, consider the searches taking place on “data visualization” over the past four years throughout the world:
Since I live in the US, I was curious about how my home country broke down. In other words, is dataviz more popular in different parts of the country? With Google Trends, that’s not hard to see:
Note here that new and popular are not necessarily one and the same. Again, this was meant to serve as a proxy–and to illustrate the fact that dataviz doesn’t necessarily lead to a particular next step. I was exploring the data and, if I really wanted, I could keep going.
Data discovery doesn’t necessarily lead to a logical outcome–and that’s fine.
What say you?
Richard Ordowich, commenting on my Hail to the Chiefs post, remarked how “most organizations need to improve their data literacy. Many problems stem from inadequate data definitions, multiple interpretations and understanding about the meanings of data. Skills in semantics, taxonomy and ontology as well as information management are required. These are skills that typically reside in librarians but not CDOs. Perhaps hiring librarians would be better than hiring a CDO.”
I responded that maybe not even librarians can save us by citing The Library of Babel, a short story by Argentine author and librarian Jorge Luis Borges, which is about, as James Gleick explained in his book The Information: A History, A Theory, A Flood, “the mythical library that contains all books, in all languages, books of apology and prophecy, the gospel and the commentary upon that gospel and the commentary upon the commentary upon the gospel, the minutely detailed history of the future, the interpolations of all books in all other books, the faithful catalogue of the library and the innumerable false catalogues. This library (which others call the universe) enshrines all the information. Yet no knowledge can be discovered there, precisely because all knowledge is there, shelved side by side with all falsehood. In the mirrored galleries, on the countless shelves, can be found everything and nothing. There can be no more perfect case of information glut.”
More than a century before the rise of cloud computing and the mobile devices connected to it, the imagination of Charles Babbage foresaw another library of Babel, one where “the air itself is one vast library, on whose pages are forever written all that man has ever said or woman whispered.” In a world where word of mouth has become word of data, sometimes causing panic about who may be listening, Babbage’s vision of a permanent record of every human utterance seems eerily prescient.
Of the cloud, Gleick wrote about how “all that information—all that information capacity—looms over us, not quite visible, not quite tangible, but awfully real; amorphous, spectral; hovering nearby, yet not situated in any one place. Heaven must once have felt this way to the faithful. People talk about shifting their lives to the cloud—their informational lives, at least. You may store photographs in the cloud; Google is putting all the world’s books into the cloud; e-mail passes to and from the cloud and never really leaves the cloud. All traditional ideas of privacy, based on doors and locks, physical remoteness and invisibility, are upended in the cloud.”
“The information produced and consumed by humankind used to vanish,” Gleick concluded, “that was the norm, the default. The sights, the sounds, the songs, the spoken word just melted away. Marks on stone, parchment, and paper were the special case. It did not occur to Sophocles’ audiences that it would be sad for his plays to be lost; they enjoyed the show. Now expectations have inverted. Everything may be recorded and preserved, at least potentially: every musical performance; every crime in a shop, elevator, or city street; every volcano or tsunami on the remotest shore; every card played or piece moved in an online game; every rugby scrum and cricket match. Having a camera at hand is normal, not exceptional; something like 500 billion images were captured in 2010. YouTube was streaming more than a billion videos a day. Most of this is haphazard and unorganized.”
The Library of Babel is no longer fiction. Big Data is the Library of Babel.
Lately I’ve been crafting an ontology that uses English language prepositions in a manner that seems natural and economic while staying semantically distinguishable. For reference sake, let’s call this ontology “Grover“, from the hilarious skits Grover on Sesame Street staged to teach children nuanced semantic differences among prepositions like “at” “by” “on” and all the others (just 100) which glue together our nouns into discrete concepts upon which children sometimes but not always willingly act.
I began this effort as I noticed that ontologies were being published that named many of its properties the same as their classes. For instance, the Data Cube Vocabulary I’ve blogged about has a class named “Dataset” and a property named “dataset”. The property “dataset” names, predictably, the “Dataset” class as its range, a pattern that provides little to no “sidecar” information when a concept is labelled with subject, predicate and object. To me, this pattern is rankly duplicative.
A better approach to naming properties, our glue between our common and proper nouns, flows foremost from the requirement that all instances of
rdf:Resource are named as
type:name; the effects of this requirement on an ontology, can be surprisingly profound. To begin with the above pattern would be aliased to, for instance, the
is:within, more accurately stating that a given Observation “is:within” a given Dataset. But, note that with a property named “dataset” there’s little chance to ever exploit its inverse, that is, one that names the mirror property that flows from a dataset to its set of observations.
With Grover however, the inverse of “is:within” is “has:within”, perhaps unfortunately because it’s such a simple statement of fact. Consequently a dataset can be queried to determine its set of observations, a seemingly costless improvement to the Data Cube Vocabulary. But it’s costless only to the extent that the resource naming pattern mentioned above, is followed:
type:name convention is key to maintaining the only advantage that conventional noun-oriented property names enjoy, namely, ensuring the type of the object of a given property is of one (or more) certain classes, as ontology applications try to keep the ‘range’ of a property consistent throughout its knowledgebase operations.
But Grover’s prepositions suggest the Resource Description Framework needs to provide a new mechanism for specifying pair-wise combinations of domains and ranges which are legal as classes for Subject and Object resources in triples involving any given RDF Property. Otherwise there’s not other means evident to specify the formalities of implied property/class relationships that exist in a Grover universe.
type:name convention works because queries can be just as easily written to test (using a
LIKE operator) to test for certain prefixes on a resource’s name, e.g., for
Dataset:*. However note the other use of the colon in names, namely for prepositional properties such as
is:within. These prefixes are actually xml namespace prefixes, with the idea that verb-based semantic namespace identifiers present several valuable advantages for a highly practical ontology.
Ciao until next month!
Recently the W3C issued a Candidate Recommendation that caught my eye about the Data Cube Vocabulary which claims to be both very general but also useful for data sets such as survey data, spreadsheets and OLAP. The Vocabulary is based on SDMX, an ISO standard for exchanging and sharing statistical data and metadata among organizations, so there’s a strong international impetus towards consensus about this important piece of Internet 3.0.
Linked Open Data (LOD) is as W3C says “an approach to publishing data on the web, enabling datasets to be linked together through references to common concepts.” This is an approach based on W3′s Resource Desription Frmework (RDF), of course. So the foundational ontology actually to implement this quite worthwhile but grandiose LOD vision is undoubtedly to be this Data Cube Vocabulary — very handily enabling the exchange of
semantically annotated HTML tables. Note that relational government data tables can now be published as “LOD data cubes” which are shareable with a public adhering to this new world-standard ontology.
But as the key logical layer in this fast-coming semantic web, Data Cubes very well may affect the manner an Enterprise ontology might be designed. Start with the fact that Data Cubes are themselves built upon several more basic RDF-compliant ontologies:
The Data Cube Vocabulary says that every Data Cube is a Dataset that has a Dataset Definition (more a “one-off ontology” specification). Any dataset can have many Slices of a metaphorical, multi-dimensional “pie” of the dataset. Within the dataset itself and within each slice are unbounded masses of Observations – each observation has values for not only the measured property itself but also any number of applicable key values — that’s all there’s to it, right?
Think of an HTML table. A “data cube” is a “table” element, whose columns are “slices” and whose rows are “keys”. “Observations” are the values of the data cells. This is clear but now the fun starts with identifying the TEXTUAL VALUES that are within the data cells of a given table.
Here is where “Dataset Descriptions” come in — these are associated with
an HTML table elment an LOD dataset. These describe all possible dataset dimension keys and the different kinds of properties that can be named in an Observation. Text attributes, measures, dimensions, and coded properties are all provided, and all are sub-properties of
This is why a Dataset Description is a “one-off ontology”, because it defines only text and numeric properties and, importantly, no classes of functional things. So with perfect pitch, the Data Cube Vocabulary virtually requires Enterprise Ontologies to ground their property hierarchy with the measure, dimension, code, and text attribute properties.
Data Cubes define just a handful of classes like “Dataset” “Slice” “SliceKey” and “Observation”. How are these four classes best inserted to an enterprise’s ontology class hierarchy? “Observation” is easy — it should be the base class of all observable properties, that is, all and only textual properties. “SliceKey” is a Role that an observable property can play. A “Slice” is basically an annotated
rdf:Bag, mediated by
skos:Collection at times.
A “Dataset” is a hazy term applicable to anythng classifiable as data objects or as data structures, that is, “a set of data” is merely an aggregate collection of data items just as a data object or data structure is. Accordingly, a Data Cube “dataset” class might be placed at or near the root of a class hierarchy, but its more clear to establish it as a subclass of an
There’s more to this topic saved for future entries — all those claimed dependencies need to be examined.
Available on Archive: The Open MIKE Podcast Series
The Open MIKE Podcast is a video podcast show, hosted by Jim Harris, which discusses aspects of the MIKE2.0 framework, and features content contributed to MIKE 2.0 Wiki Articles, Blog Posts, and Discussion Forums.
You can scroll through all 12 of the Open MIKE Podcast episodes below:
Feel free to check them out when you have a moment.
This Week’s Blogs for Thought:
Hail to the Chiefs
The MIKE2.0 wiki defines the Chief Data Officer (CDO) as one that plays a key executive leadership role in driving data strategy, architecture, and governance as the executive leader for data management activities.
“Making the most of a company’s data requires oversight and evangelism at the highest levels of management,” Anthony Goldbloom and Merav Bloch explained in their Harvard Business Review blog post Your C-Suite Needs a Chief Data Officer.
Goldbloom and Bloch describe the CDO as being responsible for identifying how data can be used to support the company’s most important priorities, making sure the company is collecting the right data, and ensuring the company is wired to make data-driven decisions.
Check it out.
Can Big Data Save CMOs?
Executive turnover has always fascinated me, especially as of late. HP’s CEO Leo Apotheker had a very short run and Yahoo! has been a veritable merry-go-round over the last five years. Beyond the CEO level, though, many executive tenures resemble those of Spinal Tap drummers. For instance, CMOs have notoriously short lifespans. While the average tenure of a CMO has increased from 23.6 to 43 months since 2004, it’s still not really a long-term position. And I wonder if Big Data can change that.
The Downside of DataViz
ITWorld recently ran a great article on the perils of data visualization. The piece covers a number of companies, including Carwoo, a startup that aims to make car buying easier. The company has been using dataviz tool Chartio for a few months. From the article:
If any modern economy wants to keep, or even add, value to their country as the digital economy grows it has to search for productivity in new ways. That means bringing together innovations from IT that are outside today’s core and combining them with solutions developed both locally and globally.
Global economies are experiencing a period of rapid change that has arguably not been seen before by anyone less than 80 years of age. Global drivers have been shifting from valuing the making of things to the flow of intellectual capital. This is the shift to an information economy which has most recently been dubbed “digital disruption”.
In just a few short years the digital economy has grown from insignificance to being something that politicians around the world need to pay attention to.
Unfortunately, most governments see the digital economy in terms of the loss of tax revenue from activities performed over the internet rather understanding the extent of the need to recalibrate their economies to the new reality.
While tax loopholes are always worth pursuing, the real focus should be stepping up to challenge of every country around the world: how to keep adding value locally to protect the local economy and jobs. There is absolutely no point, for instance, in complaining about there being less tax on music streaming than the manufacture, distribution and sale of CDs. The value is just in a different place and most of it isn’t where it was.
Although the loss of bookstores and music retailers has been one of the most visible changes, the shift in spending has come at an incredible pace. Just ask any postal service or the newspaper industry. As citizens we are keen to take advantage of the advances of smartphones, the integration of supply chains with manufacturing in China (giving us very cheap goods) and the power of social media to share of information with our friends. We are less keen when our kids lose their jobs as retail outlets close, manufacturing shuts down and the newsagent’s paper round gets shorter.
We all benefited dramatically as IT improved the efficiency and breadth of services that government and business was able to offer. Arguably the mass rollout of business IT was as important to productivity in the 1990s as the economic reforms of the 1980s. As a direct result there are now millions of people employed in IT around the world.
While this huge workforce has been responsible for so much, today it is not being applied enough to protect economies from leaking value offshore. Many companies regard technology as something they do just enough of to get on with their “real” businesses, even as their markets diminish. Even “old economy” businesses need to be encouraging their IT departments to innovate and apply for patents.
To protect their tax base, and future prosperity, each country has to search for productivity in new ways. That means combining innovations from IT that are outside today’s core and combining them with solutions developed by other organisations locally and internationally. It means looking beyond the core business and being prepared to spin-off activities at the edge that have real value in their own right. And it also means governments and large enterprises need to change the way that they procure so that they are seeding whole new economies.
Today when politicians and executives hear about IT projects they don’t think about the productivity gain, they just fear the inevitable delays and potential for bad press. That’s because large organisations have traditionally approached IT as a 1990s procurement problem rather than as an opportunity to seed a new market. A market that is desperately needed by small and medium enterprises who, while innovative, find it very hard to compete for big IT projects leaving many solutions being limited to less innovative and less efficient approaches. Every time this happens the local and global economy takes a small productivity hit which ultimately hurts us all.
Imagine a world of IT where government and large business doesn’t believe it has to own the systems that provide its citizens and customers with services. This is the economy that cloud computing is making possible with panels of providers able to delivery everything from fishing licences to payroll for a transactional fee.
Payment for service reduces government and business involvement in the risky business of delivering large scale IT projects while at the same time providing a leg-up for local businesses to become world leaders using local jobs.
Government can have a major impact through policy settings in areas such as employee share schemes, R&D tax credits and access to skilled labour. However the biggest single impact they can have growing their digital economies and putting the huge IT workforce to productive work is through the choices they make in what they buy.
Business can have a major impact on productivity by managing cost in the short-term by better integrating with local and global providers, but to repeat the benefits of the 1990s productivity improvements will require a willingness to invent new solutions using the most important tools of our generation: digital and information technology.
“Who, what, and where, by what helpe, and by whose:
Why, how, and when, doe many things disclose. ”
- Thomas Wilson, The Arte of Rhetorique, 1560
Our human proclivity for story-telling as the primary method we have to communicate our culture — its values, tales, foibles and limitless possibilities — seems to have been ignored if not forgotten, by the standards we have for digital communications. Rather than build systems that promote human understanding about and appreciation for the complexity of human endeavor, we have built instead a somewhat technocratic structure to transmit ‘facts’ about ‘things’. A consequence of this mono-maniacal focus on ‘facts’ is that we risk the loss of humanity in the thicket of numbers and pseudo-semantic string values we pass around among ourselves.
Let’s instead look at a way to structure our communications that relies on time-tested concepts of ‘best practice’ story-telling. It may seem odd to talk of ‘best practices’ and story-telling in the same breath, though indeed we’re all familiar with these ideas.
The “5 Ws” are a good place to start. These are five or six or sometimes more, basic questions whose answers are essential to effective story-telling, often taught to students of journalism, scientific research, and police forensics. Stories which neglect to relay the entirety of the “Who-What-When-Where-Why” (the five Ws) of a topic too easily may leave listeners ‘semi-ignorant’ about the topic. In a court of law for instance, trials in which method (how) motive (why) and opportunity (when) are not fully explicated and validated, rightly end with verdicts of ‘not guilty’.
The same approach — providing the ‘full story’ about a topic — should undergird our methods of digital communications as well. Perhaps by doing so, much of the current debate about “context” for/of any particular semantic (RDF) dataset might be more tractable and resolvable.
The practical significance of a “5 Ws” approach is that it can directly provide a useful metric about RDF datasets. A low score suggests the dataset makes small effort to give the ‘full story’ about its set of topics, while a high score would indicate good conformance to the precepts of best practice communications. In the real world, for instance, a threshold for this metric could be specified in contracts which envision information exchanged between its parties.
While a high-score of course wouldn’t attest to the reliability or consistency of each answer to the “5 Ws” for a given topical datastream, a low-score is indicative that the ‘speaker’ is merely spouting facts (that is, the RDF approach, which is to “say anything about anything”) best used to accentuate one’s own story but not useful as a complete recounting in its own right.
A “best practice communications” metric might be formulated by examining the nature of the property values associated with a resource. If entities are each a subclass of a 5W class, then it can be a matter of “provisionally checking the box” to the extent that some answer exists: a 100% score might indicate the information about a given topic is nominally complete while a 0% score indicates that merely a reference to the existence of the resource has been provided.
Viewing semantic datastreams each as a highly formalized story (or set of stories), then applying quality criteria developed by professional communicators as long ago as Cicero, can provide valuable insights when building high quality data models and transmitting high quality datastreams.
Like many, I’m one who’s been around since the cinder block days, once entranced by shiny Tektronix tubes stationed nearby a dusty card sorter. After years using languages as varied as Assembler through Scheme, I’ve come to believe the shift these represented, from procedural to declarative, has well-improved the flexibility of software organizations produce.
Interest has now moved towards an equally flexble representation of data. In the ‘old’ days when an organization wanted to collect a new data-item about, say, a Person, then a new column would first be added by a friendly database administrator to a Person Table in one’s relational database. Very inflexible.
The alternative — now widely adopted — reduces databases to a simple forumulation, one that eliminates Person and other entity-specific tables altogether. These “triple-stores” basically have just three columns — Subject, Predicate and Object — in which all data is stored. Triple-stores are often called ‘self-referential’ because first, the type of a Subject of any row in a triple-store is found in a different row (not column) in the triple-store and second, definitions of types are found in different rows of the triple-store. The benefits? Not only is the underlying structure of a triple-store unchanging, but also stand-alone metadata tables (tables describing tables) are unnecessary.
Why? Static relational database tables do work well enough to handle transactional records whose dataitems are usually well-known in advance; the rate of change in those business processes is fairly low, so that the cost of database architectures based on SQL tables is equally low. What, then, is driving the adoption of triple-stores?
The scope of business functions organizations seek to automate has enlarged considerably: the source of new information within an organization is less frequently “forms” completed by users, now more frequently raw text from documents; tweets; blogs; emails; newsfeeds; and other ‘social’ web and internal sources; which have been produced received &or retrieved by organizations.
Semantic technologies are essential components of Natural Language Processing (NLP) applications which extract and convert, for instance, all proper nouns within a text into harvestable networks of “information nodes” found in a triple-store. In fact during such harvesting, context becomes a crucial variable that can change with each sentence analyzed from the text.
Bringing us to my primary distinction between really semantic and non-semantic applications: really semantic applications mimic a human conversation, where the knowledge of an indivdual in a conversation is the result of a continuous accrual of context-specific facts, context-specific definitions, even context-specific contexts. As a direct analogy, Wittgenstein, a modern giant of philosophy, calls this phenomena Language Games to connote that one’s techniques and strategies for analysis of a game’s state and one’s actions, is not derivable in advance — it comes only during the play of the game, i.e., during processing of the text corpora.
Non-semantic applications on the other hand, are more similar to rites, where all operative dialogs are pre-written, memorized, and repeated endlessly.
This analogy to human conversations (to ‘dynamic semantics’) is hardly trivial; it is a dominant modelling technique among ontologists as evidenced by development of, for instance, Discourse Representation Theory (among others, e.g., legal communities have a similar theory, simply called Argumentation) whose rules are used to build Discourse Representation Structures from a stream of sentences that accommodate a variety of linguistic issues including plurals, tense, aspect, generalized quantifiers, anaphora and others.
“Semantic models” are an important path towards a more complete understanding of how humans, when armed with language, are able to reason and draw conclusions about the world. Relational tables, however, in themselves haven’t provided similar insight or re-purposing in different contexts. This fact alone is strong evidence that semantic methods and tools must be prominent in any organization’s technology plans.
Content Contributor Contest
Hey members! Ready to showcase your IM skills? Contribute a new wiki article or expand an existing article between now and September 31, 2013 and you’ll be automatically entered into a drawing for a free iPad mini!
Contest Terms and Conditions:
Eligibility is open to any new or existing MIKE2.0 member who makes a significant wiki article expansion (expansion must be greater than half the existing wiki article) or creates a new article. Please note that new articles MUST be related to our core IM competencies such as Business Intelligence, Agile Development, Big Data, Information Governance, etc. Off-topic, promotional and spam articles will be automatically disqualified.
Drawing to take place on October 1, 2013. Winning community member will be notified at the email address listed on their user profile. Contest not open to existing MGA board members, MIKE2.0 team, or contractors.
Need somewhere to start?
How about the [most wanted pages]; or the pages we know [need more work] ; or even the [stub]  that somebody else has started, but hasn’t been able to finish. Still not sure? Just contact us at email@example.com and we’ll be happy to help refer you to some articles that might be a good fit for your expertise.
TODAY: Thu, April 24, 2014April2014