Archive for the ‘Information Development’ Category
It’s fashionable to be able to claim that you’ve moved everything from your email to your enterprise applications “into the cloud”. But what about your data? Just because information is stored over the Internet, it shouldn’t necessarily qualify as being “in the cloud”.
New cloud solutions are appearing at an incredible rate. From productivity to consumer applications the innovations are staggering. However, there is a world of difference between the ways that the data is being managed.
The best services are treating the application separately to the data that supports it and making the content easily available from outside the application. Unfortunately there is still a long way to go before the aspirations of information-driven businesses can be met by the majority of cloud services as they continue to lock away the content and keep the underlying models close to their chest.
An illustrative example of best is a simple drawing solution, Draw.io. Draw.io is a serious threat to products that support the development of diagrams of many kinds. Draw.io avoids any ownership of the diagrams by reading and saving XML and leaving it to its user to decide where to put the content while making it particularly easy to integrate with your Google Drive or Dropbox account, keeping the content both in the cloud and under your control. This separation is becoming much more common with cloud providers likely to bifurcate between the application and data layers.
You can see Draw.io in action as part of a new solution for entity-relationship diagrams in the tools section of www.infodrivenbusiness.com .
Offering increasing sophistication in data storage are the fully integrated solutions such as Workday, Salesforce.com and the cloud offerings of the traditional enterprise software companies such as Oracle and SAP. These vendors are realising that they need to work seamlessly with other enterprise solutions either directly or through third-party integration tools.
Also important to watch are the offerings from Microsoft, Apple and Google which provide application logic as well as facilitating third-party access to cloud storage, but lead you strongly (and sometimes exclusively) towards their own products.
There are five rules I propose for putting data in the cloud:
1. Everyone should be able to collaborate on the content at same time
To be in the cloud, it isn’t enough to back-up the data on your hard disk drive to an Internet server. While obvious, this is a challenge to solutions that claim to offer cloud but have simply moved existing file and database storage to a remote location. Many cloud providers are now offering APIs to make it easy for application developers to offer solutions with collaboration built-in.
2. Data and logic are separated
Just like the rise of client/server architectures in the 1990s, cloud solutions are increasingly separating the tiers of their architecture. This is where published models and the ability to store content in any location is a real advantage. Ideally the data can be moved as needed providing an added degree of flexibility and the ability to comply with different jurisdictional requirements.
3. The data is available to other applications regardless of vendor
Applications shouldn’t be a black box. The trend towards separating the data from the business logic leads inexorably towards open access to the data by different cloud services. Market forces are also leading towards open APIs and even published models.
4. The data is secure
The content not only needs to be secure, but it also needs to be seen to be secure. Ideally it is only visible to the owner of the content and not the cloud application or storage vendor. This is where those vendors offering solutions that separate application logic and storage have an advantage given that much of the security is in the control of the buyer of the service.
5. The data remains yours
I’ve written about data ownership before (see You should own your own data). This is just as important regardless of whether the cloud solution is supporting a consumer, a business or government.
Over the course of my career, I have written more reports than I can count. I’ve created myriad dashboards, databases, SQL queries, ETL tools, neat Microsoft Excel VBA, scripts, routines, and other ways to pull and massage data.
In a way, I am Big Data.
This doesn’t make me special. It just makes me a seasoned data-management professional. If you’re reading this post, odds are that the list above resonates with you.
Three Problems with Creating Excessive Reports
As an experienced report writer, it’s not terribly hard to pull data from databases table, external sources, and the web. There’s no shortage of forums, bulletin boards, wikis, websites, and communities devoted to the most esoteric of data- and report-related concerns. Google is a wonderful thing.
I’ve made a great deal of money in my career by doing as I was told. That is, a client would need me to create ten reports and I would dutifully create them. Sometimes, though, I would sense that ten weren’t really needed. I would then ask if any reports could be combined. What if I could build only six or eight reports to give that client the same information? What if I could write a single report with multiple output options?
There are three main problems with creating an excessive number of discrete reports. First, it encourages a rigid mode of thinking, as in: “I’ll only see it if it’s on the XYZ report.” For instance, Betty in Accounts Receivable runs an past due report to find vendors who are more than 60 days late with their payments. While this report may be helpful, it will fail to include any data that does not meet predefined criterion. Perhaps her employer is particularly concerned about invoices from particularly shady vendors only 30 days past due.
Second, there’s usually a great deal of overlap. Organizations with hundreds of standard reports typically use multiple versions of the same report. If you ran a “metareport”, I’d bet that some duplicates would appear. In and of itself, this isn’t a huge problem. But often database changes means effectively modifying the same multiple times.
Third, and most important these days, the reliance upon standard reports inhibits data discovery.
Look, standard reports aren’t going anywhere. Simple lists and financial statements are invaluable for millions of organizations.
At the same time, though, one massive report for everything is less than ideal. Ditto for a “master” set of reports. These days, true data discovery tools like Tableau increase the odds of finding needles in haystacks.
Why not add interactivity to basic reports to allows non-technical personnel to do more with the same tools?
What say you?
Big Data requires million-dollar investments.
Nonsense. That notion is just plain wrong. Long gone are the days in which organizations need to purchase expensive hardware and software, hire consultants, and then three years later start to use it. Sure, you can still go on-premise, but for many companies cloud computing, open source tools like Hadoop, and SaaS have changed the game.
But let’s drill down a bit. How can an organization get going with Big Data quickly and inexpensively? The short answer is, of course, that it depends. But here are three trends and technologies driving the diverse state of Big Data adoption.
Crowdsourcing and Gamification
Consider Kaggle. Founded in April 2010 by Anthony Goldbloom and Jeremy Howard, the company seeks to make data science a sport, and an affordable one at that. Kaggle is equal parts rowdsourcing company, social network, wiki, gamification site, and job board (like Monster or Dice).
Kaggle is a mesmerizing amalgam of a company, one that in many ways defies business convention. Anyone can post a data project by selecting an industry, type (public or private), type of participation (team or individual), reward amount, and timetable.” Kaggle lets you easily put data scientists to work for you, and renting is much less expensive than buying them.
Open Source Applications
But that’s just one way to do Big Data in a relatively inexpensive manner–at least compared to building everything from scratch and hiring a slew of data scientists. As I wrote in Too Big to Ignore, digital advertising company Quantcast attacked Big Data in a very different way, forking the Hadoop file system. This required a much larger financial commitment than just running contest on Kaggle.
The common thread: Quantcast’s valuation is nowhere near that of Facebook, Twitter, et al. The company employs dozens of people–not thousands.
Finally, even large organizations with billion-dollar budgets can save a great deal of money on the Big Data front. Consider NASA, nowhere close to anyone’s definition of small. NASA embraces open innovation, running contests on Innocentive to find low-cost solutions to thorny data issues. NASA often prizes in the thousands of dollars, receiving suggestions and solutions from all over the globe.
I’ve said this many times. There’s no one “right” way to do Big Data. Budgets, current employee skills, timeframes, privacy and regulatory concerns, and other factors should drive an organization’s direction and choice of technologies.
What say you?
Few computing and technological achievements rival IBM’s Watson. Its impressive accomplishments to this point include high-profile victories in chess and Jeopardy!
Turns out that we ain’t seen nothin’ yet. Its next incarnation will be much more developer-friendly. From a recent GigaOM piece:
Developers who want to incorporate Watson’s ability to understand natural language and provide answers need only have their applications make a REST API call to IBM’s new Watson Developers Cloud. “It doesn’t require that you understand anything about machine learning other than the need to provide training data,” Rob High, IBM’s CTO for Watson, said in a recent interview about the new platform.
The rationale to embrace platform thinking is as follows: As impressive as Watson is, even an organization as large as IBM (with over 400,000 employees) does not hold a monopoly on smart people. Platforms and ecosystems can take Watson in myriad directions, many of which you and I can’t even anticipate. Innovation is externalized to some extent. (If you’re a developer curious to get started, knock yourself out.)
Continue reading the article and you’ll see that Watson 2.0 “ships” not only with an API, but an SDK, an app store, and a data marketplace. That is, the more data Watson has, the more it can learn. Can someone say network effect?
Think about it for a minute. A data marketplace? Really? Doesn’t information really want to be free?
Well, yes and no. There’s no dearth of open data on the Internet, a trend that shows no signs of abating. But let’s not overdo it. The success of Kaggle has shown that thousands of organizations are willing to pay handsomely for data that solves important business problems, especially if that data is timely, accurate, and aggregated well. As a result, data marketplaces are becoming increasingly important and profitable.
Simon Says: Embrace Data and Platform Thinking
The market for data is nothing short of vibrant. Big Data has arrived, but not all data is open, public, free, and usable.
Combine the explosion of data with platform thinking. It’s not just about the smart cookies who work for you. There’s no shortage of ways to embrace platforms and ecosystems, even if you’re a mature company. Don’t just look inside your organization’s walls for answers to vexing questions. Look outside. You just might be amazed at what you’ll find.
What say you?
In 1928, the physicist Paul Dirac, while attempting to describe the electron in quantum mechanical terms, posited the theoretical existence of the positron, a particle with all the electron’s properties but of opposite charge. In 1932, the experiments of physicist Carl David Anderson confirmed the positron’s existence, a discovery for which he was awarded the 1936 Nobel Prize in Physics.
“If you had asked Dirac or Anderson what the possible applications of their studies were,” Stuart Firestein wrote in his 2012 book Ignorance: How It Drives Science, “they would surely have said their research was aimed simply at understanding the fundamental nature of matter and energy in the universe and that applications were unlikely.”
Nonetheless, 40 years later a practical application of the positron became a part of one of the most important diagnostic and research instruments in modern medicine when, in the late 1970s, biophysicists and engineers developed the first positron emission tomography (PET) scanner.
“Of course, a great deal of additional research went into this as well,” Firestein explained, “but only part of it was directed specifically at making this machine. Methods of tomography, an imaging technique, some new chemistry to prepare solutions that would produce positrons, and advances in computer technology and programming—all of these led in the most indirect and fundamentally unpredictable ways to the PET scanner at your local hospital. The point is that this purpose could never have been imagined even by as clever a fellow as Paul Dirac.”
This story came to mind since it’s that time of year when we try to predict what will happen next year.
“We make prediction more difficult because our immediate tendency is to imagine the new thing doing an old job better,” explained Kevin Kelly in his 2010 book What Technology Wants. Which is why the first cars were called horseless carriages and the first cellphones were called wireless telephones. But as cars advanced we imagined more than transportation without horses, and as cellphones advanced we imagined more than making phone calls without wires. The latest generation of cellphones are now called smartphones and cellphone technology has become a part of a mobile computing platform.
IDC predicts 2014 will accelerate the IT transition to the emerging platform (what they call the 3rd Platform) for growth and innovation built on the technology pillars of mobile computing, cloud services, big data analytics, and social networking. IDC predicts the 3rd Platform will continue to expand beyond smartphones, tablets, and PCs to the Internet of Things.
Among its 2014 predictions, Gartner included the Internet of Everything, explaining how the Internet is expanding beyond PCs and mobile devices into enterprise assets such as field equipment, and consumer items such as cars and televisions. According to Gartner, the combination of data streams and services created by digitizing everything creates four basic usage models (Manage, Monetize, Operate, Extend) that can be applied to any of the four internets (People, Things, Information, Places).
These and other predictions for the new year point toward a convergence of emerging technologies, their continued disruption of longstanding business models, and the new business opportunities that they will create. While this is undoubtedly true, it’s also true that, much like the indirect and unpredictable paths that led to the PET scanner, emerging technologies will follow indirect and unpredictable paths to applications as far beyond our current imagination as a practical application of a positron was beyond the imagination of Dirac and Anderson.
“The predictability of most new things is very low,” Kelly cautioned. “William Sturgeon, the discoverer of electromagnetism, did not predict electric motors. Philo Farnsworth did not imagine the television culture that would burst forth from his cathode-ray tube. Advertisers at the beginning of the last century pitched the telephone as if it was simply a more convenient telegraph.”
“Technologies shift as they thrive,” Kelly concluded. “They are remade as they are used. They unleash second- and third-order consequences as they disseminate. And almost always, they bring completely unpredicted effects as they near ubiquity.”
It’s easy to predict that mobile, cloud, social, and big data analytical technologies will near ubiquity in 2014. However, the effects of their ubiquity may be fundamentally unpredictable. One unpredicted effect that we all became painfully aware of in 2013 was the surveillance culture that burst forth from our self-surrendered privacy, which now hangs the Data of Damocles wirelessly over our heads.
Of course, not all of the unpredictable effects will be negative. Much like the positive charge of the positron powering the positive effect that the PET scanner has had on healthcare, we should charge positively into the new year. Here’s to hoping that 2014 is a happy and healthy new year for us all.
Guiding Principles for the Open Semantic Enterprise
We’ve recently released the seventh episode of our Open MIKE Podcast series.
Episode 07: “Guiding Principles for the Open Semantic Enterprise” features key aspects of the following MIKE2.0 solution offerings:
Semantic Enterprise Guiding Principles: openmethodology.org/wiki/Guiding_Principles_for_the_Open_Semantic_Enterprise
Semantic Enterprise Composite Offering: openmethodology.org/wiki/Semantic_Enterprise_Composite_Offering
Semantic Enterprise Wiki Category: openmethodology.org/wiki/Category:Semantic_Enterprise
The Open MIKE Podcast is a video podcast show, hosted by Jim Harris, which discusses aspects of the MIKE2.0 framework, and features content contributed to MIKE 2.0 Wiki Articles, Blog Posts, and Discussion Forums.
We kindly invite any existing MIKE contributors to contact us if they’d like to contribute any audio or video segments for future episodes.
Check it out and feel free to check out our community overview video for more information on how to get involved with MIKE2.0.
On Twitter? Contribute and follow the discussion via the #MIKEPodcast hashtag.
This Week’s Blogs for Thought:
RDF PROVenance Ontology
This week features a special 3-part focus on the RDF PROVenance Ontology, a Recommendation from the Worldwide Web Consortium that is expected to be a dominant interchange standard for enterprise metadata in the immediate future, replacing the Dublin Core. Following introduction, reviews and commentary about the ontology’s RDF classes and RDF properties should stimulate lively discussion of the report’s conclusions.
Has your data quality been naughty or nice?
Whether you celebrate it as a holy day, holiday, or just a day off, Christmas is one of the most anticipated days of the year. ‘Tis the season for traditions, including seasonal blog posts, such as The Twelve (Data) Days of Christmas by Alan D. Duncan, A Christmas Data Carol by Nicola Askham, and an exposé OMG: Santa is Fake by North Pole investigative blogger Henrik Liliendahl Sørensen.
The 12 (Data) Days of Christmas
For my first blog post on MIKE2, and in the spirit of the season, here’s a light-hearted take on some of the challenges you might be facing if you’re dealing with increased volumes of data in the run up to Christmas… On the first day of Christmas, my true love sent to me a ragged hierarchy. On the second day of Christmas, my true love sent to me two empty files and a ragged hierarchy. On the third day of Christmas, my true love sent to me three null values, two missing files and a ragged hierarchy.Read more.
The previous article looked at some of the hierarchical properties in the notational syntax proposed by the W3C’s RDF PROVenance Ontology. This entry reviews now its class hierarchy.
The PROVenance Ontology’s base model is composed of three classes:
Entity-Agent-Activity, which echos W. McCarthy’s
Resource-Actor-Event (REA) model for financial accounting transactions. Which is pretty nice since the REA model has a solid pedigree: it is foundational to OASIS, OMG and UN electronic commerce standards and is now headed towards becoming an ISO specification (one which may soon include long-overdue accounting of economic transaction externalities – hooray!).
In a linguistic framework these base models might both be viewed a subset of a higher-order model by which essential journalistic questions (who, what, when, which, why, how and how much) are root concepts. The PROVenance Ontology’s second tier of classes however does include the
prov:Location element, “an identifiable geographic place (ISO 19112), but it can also be a non-geographic place” so though overall the PROVenance Ontology doesn’t yet establish a foundation mature enough for modelling arbitrary information, it is well on its way.
Unfortunately a capability to specify provenance for any (text or resource) attribute value — arguably a fundamental requirement — currently seems to be missing (or at least, hard to detect). One might think that a
prov:Statement element, a subclass of
rdf:Statement, would be an easy to annotate with reference(s) to a
prov:Bundle in which are details about provenance-significant events. But with the present PROV ontology, for an operation of this nature, the fact that there is no
prov:Provenance class in the PROVenance Ontology seems to make the standard appear more difficult to grasp and implement.
This has come about perhaps because, as the W3′s Steering Group noted, there’s a functional ambiguity between present metadata and historical provenance data and a resource’s own functional data, particularly for generation- & maintenance- related activities. Functional elements in this context are specific, hierarchically-organized sets of formal (& informal) agents, actions, dates, resources and usages — these “influence” to some degree the identity of a resource. The result is a set of event-type classes that can equally be used in non-provenance contexts.
Perhaps an area needing more work is the extent to which the PROVenance Ontology leverages the Simple Knowledge Organization System (SKOS), and the RDF Data Cube Ontology (Qb) for that matter, important recent members of the RDF family of W3 Recommendations. In the latter case Linked Open Data woven into the interchange format for provenance information might increase the number of tools and uptake of the ontology in the years ahead. Noticing the PROV Ontology’s focus on the use case of scientific data, though, it is a golden opportunity to show provenance markup interacting with Data Cube concepts, from
qb:Observation. In a similar manner it would be useful to state a relationship between
prov:Entity. As well an alternative design of the PROV Ontology could functionally rely on a taxonomy of activities rather than a set of classes. This approach would result in a smaller and more appealingly under-specified ontology. Not specifying properties that semantically distinguish one activity class from another, suggests that a taxonomy would be just as effective.
Capturing “influences”  that are upon a resource, a seque to ‘trust’ processes, is found in the PROV ontology as three classes:
prov:EntityInfluence. Another approach might define less what an “influential resource” is, and use a class like
prov:Influential that follows a facet-oriented design patterns. In this way, the facet can be applied to anything as a categorical tag, permitting influential relationships with things otherwise not influential in their own right. Currently the
Influence base class has three properties, only one of which properly relates to an “Influence”.
In summary the class structure presented by the PROVenance Ontology remains to be tested in practice. The simplicity of the Dublin Core “provenance model” has not yet been replicated by the PROVenance Ontology (in fact, it can’t be considered ‘useable’ in any sense until the Dublin Core ontology has been mapped). The premise of RDF to be able to say anything about anything, seems problematic for provenance statements: one can only make provenance statements about instances of
prov:Entity. Should this restriction be re-examined semantic enterprises may more fruitfully detect exchange and exploit provenance metadata.
 “Provenance of a resource is a record that describes entities and processes involved in producing and delivering or otherwise influencing that resource. Provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility. Provenance assertions are a form of contextual metadata and can themselves become important records with their own provenance” -[Provenance WG]
In a previous entry about the Worldwide Web Consortium (W3C)’s RDF PROVenance Ontology, I mentioned that it includes a notation aimed at human consumption — wow, I thought, that’s a completely new architectural thrust by the W3C. Until now the W3C has published only notations aimed at computer consumption. Now it is going to be promoting a “notation aimed at human consumption”! So here’s examples of what’s being proposed.
 activity(a2, 2011-11-16T16:00:00, 2011-11-16T16:00:01)
 wasDerivedFrom(e2, e1, a, g2, u1)
 declares that an entity named “e1″ exists. This could have been “book(e1)” presumably, so any subclass of
prov:Entity can be referenced instead. Note: The
prov:entity property and the
prov:Entity class are top concepts in the PROV ontology.
 should be read as “activity a2, which occurred between 2011-11-16T16:00:00 and 2011-11-16T16:00:01″. An ‘activity’ is a sub-property of
prov:influence, as an ‘Activity’ is a sub-class of
prov:Influence, both top concepts in the PROV ontology.
 additional details are shown for a “wasDerivedFrom” event (a sub-property of
prov:wasInfluencedBy, a top concept of the PROV ontology); to wit, that activity a, the generation g2, and the usage u1, resources are eacb related to the “wasDerivedFrom” relation.
The W3 syntax above is a giant step towards establishing standard mechanisms for semantic notations. I’m sure though this approach doesn’t yet qualify as an efficient user-level syntax for semantic notation however. First, I’d note that the vocabulary of properties essentially mirrors the vocabulary of classes — from a KISS view this ontology design pattern imposes a useless burden on a user, to learn two vocabularies of similar and hierarchical names, one for classes and one for properties. Secondly camel-case conventions are surely not amenable to most users (really, who wants to promote poor spelling?). Thirdly attribute values are not labelled classically (”
type:name”) — treating resource names opaquely wastes an obvious opportunity for clarifications and discoveries of subjects’ own patterns of information as well as incidental annotations not made elsewhere. Finally a small point, is that commas are not infrequently found in names of resources causing problems in this context.
Another approach is to move around the strings in the notations above, to achieve a more natural reading to average English speakers. Using a consistent framework of verbs and prepositions for properties named
verb:preposition, an approach introduced in an earlier entry, yields an intuitively more interesting syntax with possibilities for future expansion.
 has:this(Activity:a2; Timestamp:2011-11-16T16:00:00; Timestamp:2011-11-16T16:00:01)
 was:from(Source:e2; Event:e1; Activity:a; Generation:g2; Usage:u1)
 was:from(Source:e1; Source:e2; Source:e3)
 declares that an annotation for a specific page (a subject) has a certain entity named
e1, which may or may not exist (that is, be de-referenceable).
e1 is qualified as of type “Entity”.
 By convention the first value in a set of values provided to a “property function” is the target of the namespaced relation “has:this” with the subject resource being the resource which is being semantically annotated. Each attribute value associated with the relation is qualified by the type of value that it is.
 The property
wasDerivedFrom is here a relation with a target resource that is to be interpreted as a “Source” kind-of-thing, i.e., “role”. This relation shows four related (perhaps influential) resources.
 This is a list of attribute values acceptable in this notation, overloading the ‘was:from’ predicate function for a less tiresome syntax.
The chief advantages of this approach to semantic notations in comparison with the current W3C recommendation is first, that it eliminates the need for dual (dueling) hierarchies, by its adoption of a fixed number of prepositional properties. Second it is in a lexical sense formally more complete yet ripe with opportunities for improved semantic markup, in part by its requirement to type strings. Lastly it is intuitively clearer to an average user, perhaps leading to a more conducive process for semantic annotations.
For my first blog post on MIKE2, and in the spirit of the season, here’s a light-hearted take on some of the challenges you might be facing if you’re dealing with increased volumes of data in the run up to Christmas…
On the first day of Christmas, my true love sent to me a ragged hierarchy.
On the second day of Christmas, my true love sent to me two empty files and a ragged hierarchy.
On the third day of Christmas, my true love sent to me three null values, two missing files and a ragged hierarchy.
On the fourth day of Christmas, my true love sent to me four audit logs, three null values, two missing files and a ragged hierarchy.
On the fifth day of Christmas, my true love sent to me five golden records! Four audit logs, three null values, two missing files and a ragged hierarchy.
On the sixth day of Christmas, my true love sent to me six late arrivals, five golden records! Four audit logs, three null values, two missing files and a ragged hierarchy.
On the seventh day of Christmas, my true love sent to me seven free-text columns, six late arrivals, five golden records! Four audit logs, three null values, two missing files and a ragged hierarchy.
On the eighth day of Christmas, my true love sent to me eight mis-spelled surnames, seven free-text columns, six late arrivals, five golden records! Four audit logs, three null values, two missing files and a ragged hierarchy.
On the ninth day of Christmas, my true love sent to me nine duplications, eight mis-spelled surnames, seven free-text columns, six late arrivals, five golden records! Four audit logs, three null values, two missing files and a ragged hierarchy.
On the tenth day of Christmas, my true love sent to me ten coding errors, nine duplications, eight mis-spelled surnames, seven free-text columns, six late arrivals, five golden records! Four audit logs, three null values, two missing files and a ragged hierarchy.
On the eleventh day of Christmas, my true love sent to me eleven double meanings, ten coding errors, nine duplications, eight mis-spelled surnames, seven free-text columns, six late arrivals, five golden records! Four audit logs, three null values, two missing files and a ragged hierarchy.
On the twelfth day of Christmas, my true love sent to me twelve domain outliers, eleven double meanings, ten coding errors, nine duplications, eight mis-spelled surnames, seven free-text columns, six late arrivals, five golden records! Four audit logs, three null values, two missing files and a ragged hierarchy.
"At the toolbar (menu, whatever) associated with a document there is a button marked "Oh, yeah?". You press it when you lose that feeling of trust. It says to the Web, 'so how do I know I can trust this information?'. The software then goes directly or indirectly back to metainformation about the document, which suggests a number of reasons.” [[Tim Berners Lee, 1997]]
“The problem is – and this is true of books and every other medium – we don’t know whether the information we find [on the Web] is accurate or not. We don’t necessarily know what its provenance is.” – Vint Cerf
The Worldwide Web Consortium (W3) has hit another home-run when the RDF PROVenance Ontology officially became a member of the Resource Description Framework last May. This timely publication proposes a data model well-suited to its task: representing provenance metadata about any resource. Provenance data for a thing relates directly to its chain of ownership, its development or treatment as a managed resource, and its intended uses and audiences. Provenance data is a central requirement for any trust-ranking process that often occurs against digital resources sourced from outside an organization.
The PROV Ontology is bound to have important impacts on existing provenance models in the field, including Google’s Open Provenance Model Vocabulary; DERI’s X-Prov and W3P vocabularies; the open-source SWAN Provenance, Authoring and Versioning Ontology and Provenance Vocabulary; Inference Web’s Proof Markup Language-2 Ontology; the W3C’s now outdated RDF Datastore Schema; among others. As a practical matter, the PROV Ontology is already the underlying model for the bio-informatics industry as implemented at Oxford University, a prominent thought-leader in the RDF community.
At the core of the PROV Ontology is a conceptual data model with semantics instantiated by serializations including RDF and XML plus a notation aimed at human consumption. These serializations are used by implementations to interchange provenance data. To help developers and users create valid provenance, a set of constraints are defined, useful to the creation of provenance validators. Finally, to further support the interchange of provenance, additional definitions are provided for protocols to locate access and connect multiple provenance descriptions and,most importantly how to interoperate with the widely used Dublin Core two metadata vocabularies.
The PROV Ontology is slightly ambitious too despite the perils of over-specification. It aims to provide a model not just for discrete data-points and relations applicable to any managed-resource, but also for describing in-depth the processes relevant to its development as a concept. This is reasonable in many contexts — such as a scholarly article, to capture its bibliography — but it seems odd in the context of non-media resources such as Persons. For instance, it might be odd to think of a notation of one’s parents as within the scope of “provenance data”. The danger of over-specification is palpable in the face of grand claims that, for instance, scientific publications will be describable by the PROV Ontology to an extent that reveals “How new results were obtained: from assumptions to conclusions and everything in between” [W3 Working Group Presentation].
Recommendations. Enterprises and organizations should immediately adopt the RDF PROVenance Ontology in their semantic applications. At a minimum this ontology should be deeply incorporated within the fundamentals of any enterprise-wide models now driving semantic applications, and it should be a point of priority among key decision-makers. Based upon my review and use in my clients’ applications, this ontology is surely of a quality and scope that it will drive a massive amount of web traffic clearly to come in the not distant future. A market for user-facing ‘trust’ tools based on this ontology should begin to appear soon that can stimulate the evolution of one’s semantic applications.
Insofar as timing, the best strategy is to internally incubate the present ontology, with plans to then fully adopt the second Candidate Recommendation. This gives the standardization process for this Recommendation a chance to achieve a better level of maturity and completeness.
TODAY: Fri, July 25, 2014July2014