Open Framework, Information Management Strategy & Collaborative Governance | Data & Social Methodology - MIKE2.0 Methodology
Collapse Expand Close

To join, please contact us.

Improve MIKE 2.0
Collapse Expand Close
Need somewhere to start? How about the most wanted pages; or the pages we know need more work; or even the stub that somebody else has started, but hasn't been able to finish. Or create a ticket for any issues you have found.

Archive for the ‘Semantic Web’ Category

by: John McClure
06  Mar  2014

Grover: A Business Syntax for Semantic English

Grover is a semantic annotation markup syntax based on the grammar of the English language. Grover is related to the Object Management Group’s Semantics of Business Vocabulary and Rules (SBVR), explained later. Grover syntax assigns roles to common parts of speech in the English language so that simple and structured English phrases are used to name and relate information on the semantic web. By having as clear a syntax as possible, the semantic web is more valuable and useful.

An important open-source tool for semantic databases is SemanticMediaWiki that permits everyone to create a personal “wikipedia” in which private topics are maintained for personal use. The Grover syntax is based on this semantic tool and the friendly wiki environment it delivers, though the approach below might also be amenable to other toolsets and environments.

Basic Approach. Within a Grover wiki, syntax roles are established for classes of English parts of speech.

  • Subject:noun(s) -- verb:article/verb:preposition -- Object:noun(s)

refines the standard Semantic Web pattern:

  • SubjectURL -- PredicateURL -- ObjectURLwhile in a SemanticMediaWiki environment, with its relative URLs, this is the pattern:
  • (Subject) Namespace:pagename -- (Predicate) Property:pagename -- (Object) Namespace:pagename.


In a Grover wiki, topic types are nouns, more precisely nounal expressions, are concepts. Every concept is defined by a specific semantic database query, these queries being the foundation of a controlled enterprise vocabulary. In Grover every pagename is the name of a topic and every pagename includes a topic-type prefix. Example: Person:Barack Obama and Title:USA President of the United States of America, two topics related together through one or more predicate relations, for instance “has:this”. Wikis are organized into ‘namespaces’ — its pages’ names are each prefixed with a namespace-name, which function equally as topic-type names. Additionally, an ‘interwiki prefix’ can indicate the URL of the wiki where a page is located — in a manner compatible with the Turtle RDF language.

Nouns (nounal expressions) are the names of topic-types and or of topics; in ontology-speak, nouns are class resources or nouns are individual resources but rarely are nouns defined as property resources (and thereby used as a ‘predicate’ in the standard Semantic Web pattern, mentioned above). This noun requirement is a systemic departure from today’s free-for-all that allows nouns to be part of the name of predicates, leading to the construction of problematic ontologies from the perspective of common users.verbsIn a Grover wiki, “property names” are an additional ontology component forming the bedrock of a controlled semantic vocabulary. Being pages in the “Property” namespace means these are prefixed with the namespace name, “Property”. However the XML namespace is directly implied, for instance has:this implies a “has” XML Namespace. The full pagename of this property is “Property:has:this. The tenses of a verb — infinitive, past, present and future — are each an XML namespace, meaning there are separate have, has, had and will-have XML Namespaces. The modalities of a verb are also separate XML Namespace, may and must. Lastly the negation form for verbs (involving not) are additional XML Namespaces.

The “verb” XML Namespace name is only one part of a property name. The other part of a property name is either a preposition or it is a grammatical author. Together, these comprise an enterprise’s controlled semantic vocabulary.

As in English grammar, prepositions are used to relate an indirect object or object of a preposition, to a subject in a sentence. Example: “John is at the Safeway” uses a property named “is:at” to yield the triple Person:John -- is:at -- Store:Safeway. There are approximately about one hundred english prepositions possible for any particular verbal XML Namespace. Examples: had:from, has:until and is:in.
As in English grammar, articles such as “a” and “the” are used to relate direct objects or predicate nominatives to a subject in a sentence. As for prepositions above, articles are associated with a verb XML Namespace. Example: has:a:, has:this, has:these, had:some has:some and will-have:some.

adjectivesIn a Grover wiki, definitions in the “category” namespace include adjectives, such as “Public” and “Secure”. These categories are also found in a controlled modifier vocabulary. The category namespace also includes definitions for past participles, such as “Secured” and “Privatized”. Every adjective and past participle is a category in which any topic can be placed. A third subclass of modifiers include ‘adverbs’, categories in which predicate instances are placed.

That’s about all that’s needed to understand Grover, the Business Syntax for Semantic English! Let’s use the Grover syntax to implement a snippet from the Object Management Group’s Semantics of Business Vocabulary and Rules (SBVR) which has statements such as this for “Adopted definition”:

adopted definition
Definition: definition that a speech community adopts from an external source by providing a reference to the definition.
Necessities: (1) The concept ‘adopted definition’ is included in Definition Origin. (2) Each adopted definition must be for a concept in the body of shared meanings of the semantic community of the speech community.


Now we can use Grover’s syntax to ‘adopt’ the OMG’s definition for “Adopted definition”.
Concept:Term:Adopted definition -- is:within -- Concept:Definition
Concept:Term:Adopted definition -- is:in -- Category:Adopted
Term:Adopted definition -- is:a -- Concept:Term:Adopted definition
Term:Adopted definition -- is:also -- Concept:Term:Adopted definition
Term:Adopted definition -- is:of -- Association:Object Management Group
Term:Adopted definition -- has:this -- Reference:
Term:Adopted definition -- must-be:of -- Concept:Semantic Speech Community
Term:Adopted definition -- must-have:some -- Concept:Reference

This simplified but structured English permits the widest possible segment of the populace to participate in constructing and perfecting an enterprise knowledge base built upon the Resource Description Framework.

More complex information can be specified on wikipages using standard wiki templates. For instance to show multiple references on the “Term:Adopted definition” page, the “has:this” wiki template can be used:
Multi-lingual text values and resource references would be as follows, using the wiki templates (a) {{has:this}} and (b) {{skos:prefLabel}}
{{has:this |@=en|@en=Reference:}}
{{skos:prefLabel|@=en;de|@en=Adopted definition|@de=Angenommen definition}}

One important feature of the Grover approach is its modification of our general understanding about how ontologies are built. Today, ontologies specify classes, properties and individuals; a data model emerges from listings of range/domain axioms associated with a propery’s definition. Instead under Grover, an ontology’s data models are explicitly stated with deontic verbs that pair subjects with objects; this is an intuitively stronger and more governable approach for such a critical enterprise resource as the ontology.

Category: Business Intelligence, Enterprise Content Management, Enterprise Data Management, Enterprise2.0, Information Development, Semantic Web
No Comments »

by: John McClure
23  Dec  2013

Semantic Notations

In a previous entry about the Worldwide Web Consortium (W3C)’s RDF PROVenance Ontology, I mentioned that it includes a notation aimed at human consumption — wow, I thought, that’s a completely new architectural thrust by the W3C. Until now the W3C has published only notations aimed at computer consumption. Now it is going to be promoting a “notation aimed at human consumption”! So here’s examples of what’s being proposed.
[1] entity(e1)
[2] activity(a2, 2011-11-16T16:00:00, 2011-11-16T16:00:01)
[3] wasDerivedFrom(e2, e1, a, g2, u1)

[1] declares that an entity named “e1″ exists. This could have been “book(e1)” presumably, so any subclass of prov:Entity can be referenced instead. Note: The prov:entity property and the prov:Entity class are top concepts in the PROV ontology.
[2] should be read as “activity a2, which occurred between 2011-11-16T16:00:00 and 2011-11-16T16:00:01″. An ‘activity’ is a sub-property of prov:influence, as an ‘Activity’ is a sub-class of prov:Influence, both top concepts in the PROV ontology.
[3] additional details are shown for a “wasDerivedFrom” event (a sub-property of prov:wasInfluencedBy, a top concept of the PROV ontology); to wit, that activity a, the generation g2, and the usage u1, resources are eacb related to the “wasDerivedFrom” relation.

The W3 syntax above is a giant step towards establishing standard mechanisms for semantic notations. I’m sure though this approach doesn’t yet qualify as an efficient user-level syntax for semantic notation however. First, I’d note that the vocabulary of properties essentially mirrors the vocabulary of classes — from a KISS view this ontology design pattern imposes a useless burden on a user, to learn two vocabularies of similar and hierarchical names, one for classes and one for properties. Secondly camel-case conventions are surely not amenable to most users (really, who wants to promote poor spelling?). Thirdly attribute values are not labelled classically (”type:name”) — treating resource names opaquely wastes an obvious opportunity for clarifications and discoveries of subjects’ own patterns of information as well as incidental annotations not made elsewhere. Finally a small point, is that commas are not infrequently found in names of resources causing problems in this context.

Another approach is to move around the strings in the notations above, to achieve a more natural reading to average English speakers. Using a consistent framework of verbs and prepositions for properties named verb:preposition, an approach introduced in an earlier entry, yields an intuitively more interesting syntax with possibilities for future expansion.
[1] has:this(Entity:e1)
[2] has:this(Activity:a2; Timestamp:2011-11-16T16:00:00; Timestamp:2011-11-16T16:00:01)
[3] was:from(Source:e2; Event:e1; Activity:a; Generation:g2; Usage:u1)
[4] was:from(Source:e1; Source:e2; Source:e3)

[1] declares that an annotation for a specific page (a subject) has a certain entity named e1, which may or may not exist (that is, be de-referenceable). e1 is qualified as of type “Entity”.
[2] By convention the first value in a set of values provided to a “property function” is the target of the namespaced relation “has:this” with the subject resource being the resource which is being semantically annotated. Each attribute value associated with the relation is qualified by the type of value that it is.
[3] The property wasDerivedFrom is here a relation with a target resource that is to be interpreted as a “Source” kind-of-thing, i.e., “role”. This relation shows four related (perhaps influential) resources.
[4] This is a list of attribute values acceptable in this notation, overloading the ‘was:from’ predicate function for a less tiresome syntax.

The chief advantages of this approach to semantic notations in comparison with the current W3C recommendation is first, that it eliminates the need for dual (dueling) hierarchies, by its adoption of a fixed number of prepositional properties. Second it is in a lexical sense formally more complete yet ripe with opportunities for improved semantic markup, in part by its requirement to type strings. Lastly it is intuitively clearer to an average user, perhaps leading to a more conducive process for semantic annotations.

Category: Data Quality, Enterprise Content Management, Enterprise Data Management, Information Development, Information Governance, Information Management, Information Strategy, Semantic Web, Sustainable Development
No Comments »

by: John McClure
21  Dec  2013

The RDF PROVenance Ontology

"At the toolbar (menu, whatever) associated with a document there is a button marked "Oh, yeah?". You press it when you lose that feeling of trust. It says to the Web, 'so how do I know I can trust this information?'. The software then goes directly or indirectly back to metainformation about the document, which suggests a number of reasons.” [[Tim Berners Lee, 1997]]

“The problem is – and this is true of books and every other medium – we don’t know whether the information we find [on the Web] is accurate or not. We don’t necessarily know what its provenance is.” – Vint Cerf

The Worldwide Web Consortium (W3) has hit another home-run when the RDF PROVenance Ontology officially became a member of the Resource Description Framework last May. This timely publication proposes a data model well-suited to its task: representing provenance metadata about any resource. Provenance data for a thing relates directly to its chain of ownership, its development or treatment as a managed resource, and its intended uses and audiences. Provenance data is a central requirement for any trust-ranking process that often occurs against digital resources sourced from outside an organization.

The PROV Ontology is bound to have important impacts on existing provenance models in the field, including Google’s Open Provenance Model Vocabulary; DERI’s X-Prov and W3P vocabularies; the open-source SWAN Provenance, Authoring and Versioning Ontology and Provenance Vocabulary; Inference Web’s Proof Markup Language-2 Ontology; the W3C’s now outdated RDF Datastore Schema; among others. As a practical matter, the PROV Ontology is already the underlying model for the bio-informatics industry as implemented at Oxford University, a prominent thought-leader in the RDF community.

At the core of the PROV Ontology is a conceptual data model with semantics instantiated by serializations including RDF and XML plus a notation aimed at human consumption. These serializations are used by implementations to interchange provenance data. To help developers and users create valid provenance, a set of constraints are defined, useful to the creation of provenance validators. Finally, to further support the interchange of provenance, additional definitions are provided for protocols to locate access and connect multiple provenance descriptions and,most importantly how to interoperate with the widely used Dublin Core two metadata vocabularies.

The PROV Ontology is slightly ambitious too despite the perils of over-specification. It aims to provide a model not just for discrete data-points and relations applicable to any managed-resource, but also for describing in-depth the processes relevant to its development as a concept. This is reasonable in many contexts — such as a scholarly article, to capture its bibliography — but it seems odd in the context of non-media resources such as Persons. For instance, it might be odd to think of a notation of one’s parents as within the scope of “provenance data”. The danger of over-specification is palpable in the face of grand claims that, for instance, scientific publications will be describable by the PROV Ontology to an extent that reveals “How new results were obtained: from assumptions to conclusions and everything in between” [W3 Working Group Presentation].

Recommendations. Enterprises and organizations should immediately adopt the RDF PROVenance Ontology in their semantic applications. At a minimum this ontology should be deeply incorporated within the fundamentals of any enterprise-wide models now driving semantic applications, and it should be a point of priority among key decision-makers. Based upon my review and use in my clients’ applications, this ontology is surely of a quality and scope that it will drive a massive amount of web traffic clearly to come in the not distant future. A market for user-facing ‘trust’ tools based on this ontology should begin to appear soon that can stimulate the evolution of one’s semantic applications.

Insofar as timing, the best strategy is to internally incubate the present ontology, with plans to then fully adopt the second Candidate Recommendation. This gives the standardization process for this Recommendation a chance to achieve a better level of maturity and completeness.

Category: Data Quality, Enterprise Data Management, Information Development, Information Governance, Information Management, Information Strategy, Information Value, Master Data Management, Metadata, Semantic Web, Web Content Management
No Comments »

by: John McClure
11  Nov  2013

Semantic Business Vocabularies and Rules

For many in the traditional applications development community, “semantics” sounds like a perfect candidate for a buzzword tossed at management in an effort to pry fresh funding for what may appear to be academic projects with not much discernible practical payback. Indeed when challenged for examples of “semantic applications” often one hears stumbling litanies about “Linked Open Data”, ubiquitous “URLs” and “statement triples”. Traditional database folks might then retort where’s the beef? because URLs in web applications are certainly just as ubiquitous, are stored in database columns which are named just as “semantic properties” are; are in rows with foreign keys as construable as “subjects” in a standard semantic statement; and that’s still not to mention the many, many other SQL tools in which enterprises have heavily, heavily invested over many, many years! So…

It’s a good question – where IS the beef?

The Object Management Group (OMG), a highly respected standards organization with outsized impacts on many enterprises, has recently released a revision of its Semantics of Business Vocabulary and Rules (SBVR) that provides one clear answer to this important question.

Before we go there, let’s stipulate that for relatively straight-forward (but nonetheless closed world) applications, it’s probably not worth the time nor expense to ‘semanticize’ the application and its associated database. These are applications that have few database tables; that change at a low rate; that have fairly simple forms-based user interfaces, if any; and that are often connected through transaction queues with an enterprise’s complex of applications. For these, fine, exclude them from projects migrating an enterprise to a semantic-processing orientation.

Another genre of applications are those said ‘mission critical’. These applications are characterized by a large number of database tables and consequently a large number of columns and datatypes the application needs to juggle. These applications have moderate to high rates of change to accommodate functional requirements due to shifting (and new additions to) of dynamic enterprises — not so much mission creep as it is the normal response to the tempo of the competitive environments in which enterprises exist.

The fact is that the physical schema for a triples-based (or quad-based) semantic database does not change; the physical schema is static. Rather, it’s the ontology, the logical database schema, that changes to meet new requirements of the enterprise. This is an important outcome of a re-engineered applications development process: this eliminates often costly tasks associated with designing, configuring and deploying physical schema.

Traditionalists might view this shift as mere cleverness, one equally accomplished by tools which transform logical database designs into physical database schema. Personally I don’t have the background to debate the effectiveness of these tools. However, let’s take a larger view, one suggested by the OMG specification for Business Vocabularies and Rules.

Business Policies – where it begins, and will end

Classically business management establishes policies which are sent to an Information Technology department for incorporation to new and existing applications. It is then the job of systems analysts to stare at these goats and translate them into coding specifications for development and testing. Agile and other methodologies help speed this process internally to the IT department, however until the fundamental dynamic between management and IT changes, this cycle remains slow, costly and mistake-prone.

Now this is where OMG’s SBVR applies: it is an ontology for capturing rules such as “If A, then thou shalt not do X when Y or Z applies; otherwise thou shalt do B and C” into a machine-processable form (that is, into semantic statements). Initially suitably trained system analysts draft these statements as well as pertinent queries which are to be performed by applications at the proper moment. However at some point tools will appear that permit management themselves to draft and test the impact of new and changed policies against live databases.

This is real business process re-engineering at its brightest. Policy implementation and operational costs are affected as the same language (a somewhat ‘structured English’) is used to state what should and must be as that used to state what is. Without that common language, enterprises can only rely on the skills of systems analysts to adequately communicate business, and regulatory, requirements to others.

Capturing & weaving unstructured lexical information into enterprise applications, has never been possible with traditional databases. This is why ‘semantics’ is such a big deal.


Category: Enterprise Content Management, Information Development, Information Management, Information Strategy, Information Value, Master Data Management, Semantic Web
No Comments »

by: John McClure
07  Nov  2013

Wikidata shows the way

Studies consistently rank DBpedia as a crucial repository in the semantic web; its data is extracted from Wikipedia and then structured according to DBpedia’s own ontology. Available under Creative Commons and GNU licenses, the repository can be queried directly on the DBpedia site and it can be downloaded by the public for use within other semantic tool environments.

This is a truly AMAZING resource! The English version of the DBpedia knowledge base for instance now has over two billion ‘triples’ to describe 4,000,000+ topics — 20% are persons, 16% places, 5% organizations including 95,000 companies and educational institutions, plus creative works, species, diseases, and so on — with equally impressive statistics concerning their knowledge bases in over one hundred other languages. And the DBpedia Ontology itself has over 3,000,000 classes properties and instances. What a breath-taking undertaking in the public sphere!

Recently I had a wonderful opportunity to hear about DBpedia’s latest projects for their repository, here are the slides. DBpedia is now surely moving towards adoption of an important tool — Wikidata — in order to aggregate DBpedia’s 120 language-specific databases, into one single, multi-lingual repository.

Wikidata‘s own project requirements are interesting to the MIKE2 community as they parallel significant challenges common to most enterprises in areas of data provenance and data governance. Perhaps in response to various public criticisms about the contents of Wikipedia, Wikidata repositories support source citations for every “fact” the repository contains.

The Wikidata perspective is that it is a repository of “claims” as distinguished from “facts”. Say for example that an estimate of a country’s Gross National Product is recorded. This estimate is a claim will often change over time, and will often be confronted by counter-claims from different sources. What Wikidata does is to provide a data model that keeps track of all claimed values asserted about something, with the expectation this kind of detailed information will lead to mechanisms directly relevant to the level of “trust” that may be confidently associated with any particular ‘statement of fact’.

The importance of source citations is not restricted to the credibility of Wikipedia itself and its derivative repositories; rather this is a universal requirement common to all enterprises whether they be semantics-oriented or not. A simple proposition born of science — to distinguish one’s original creative and derived works from those learned from others — is now codified and freighted with intellectual property laws (copyrights, patents), subjects of complex international trade treaties.

Equally faced by most enterprises is a workforce located around the globe, each with varying strengths in the English language. By using a semantic repository deeply respectful of multilingual requirements — such as Wikidata is — enterprises can deploy ontologies and applications that improve worker productivity across-the-board, regardless of language.
Wikidata is a project funded by Wikipedia-Germany. Enterprises might consider helping to fund open-source projects of this nature, as these are most certainly investments whose value cannot be over-estimated.
Visit here for more Semantic MediaWiki conferences and slides. Ciao!

Category: Information Development, Information Governance, Information Management, Information Strategy, Information Value, Open Source, Semantic Web
No Comments »

by: John McClure
07  Sep  2013

Data Cubes and LOD Exchanges

Recently the W3C issued a Candidate Recommendation that caught my eye about the Data Cube Vocabulary which claims to be both very general but also useful for data sets such as survey data, spreadsheets and OLAP. The Vocabulary is based on SDMX, an ISO standard for exchanging and sharing statistical data and metadata among organizations, so there’s a strong international impetus towards consensus about this important piece of Internet 3.0.

Linked Open Data (LOD) is as W3C says “an approach to publishing data on the web, enabling datasets to be linked together through references to common concepts.” This is an approach based on W3′s Resource Desription Frmework (RDF), of course. So the foundational ontology actually to implement this quite worthwhile but grandiose LOD vision is undoubtedly to be this Data Cube Vocabulary — very handily enabling the exchange of semantically annotated HTML tables. Note that relational government data tables can now be published as “LOD data cubes” which are shareable with a public adhering to this new world-standard ontology.

But as the key logical layer in this fast-coming semantic web, Data Cubes very well may affect the manner an Enterprise ontology might be designed. Start with the fact that Data Cubes are themselves built upon several more basic RDF-compliant ontologies:

The Data Cube Vocabulary says that every Data Cube is a Dataset that has a Dataset Definition (more a “one-off ontology” specification). Any dataset can have many Slices of a metaphorical, multi-dimensional “pie” of the dataset. Within the dataset itself and within each slice are unbounded masses of Observations – each observation has values for not only the measured property itself but also any number of applicable key values — that’s all there’s to it, right?

Think of an HTML table. A “data cube” is a “table” element, whose columns are “slices” and whose rows are “keys”. “Observations” are the values of the data cells. This is clear but now the fun starts with identifying the TEXTUAL VALUES that are within the data cells of a given table.

Here is where “Dataset Descriptions” come in — these are associated with an HTML table elment an LOD dataset. These describe all possible dataset dimension keys and the different kinds of properties that can be named in an Observation. Text attributes, measures, dimensions, and coded properties are all provided, and all are sub-properties of rdf:Property.

This is why a Dataset Description is a “one-off ontology”, because it defines only text and numeric properties and, importantly, no classes of functional things. So with perfect pitch, the Data Cube Vocabulary virtually requires Enterprise Ontologies to ground their property hierarchy with the measure, dimension, code, and text attribute properties.

Data Cubes define just a handful of classes like “Dataset” “Slice” “SliceKey” and “Observation”. How are these four classes best inserted to an enterprise’s ontology class hierarchy? “Observation” is easy — it should be the base class of all observable properties, that is, all and only textual properties. “SliceKey” is a Role that an observable property can play. A “Slice” is basically an annotated rdf:Bag, mediated by skos:Collection at times.

A “Dataset” is a hazy term applicable to anythng classifiable as data objects or as data structures, that is, “a set of data” is merely an aggregate collection of data items just as a data object or data structure is. Accordingly, a Data Cube “dataset” class might be placed at or near the root of a class hierarchy, but its more clear to establish it as a subclass of an html:table class.

There’s more to this topic saved for future entries — all those claimed dependencies need to be examined.

Category: Business Intelligence, Information Development, Semantic Web
No Comments »

by: John McClure
01  Aug  2013

Making Microsense of Semantics

Like many, I’m one who’s been around since the cinder block days, once entranced by shiny Tektronix tubes stationed nearby a dusty card sorter. After years using languages as varied as Assembler through Scheme, I’ve come to believe the shift these represented, from procedural to declarative, has well-improved the flexibility of software organizations produce.

Interest has now moved towards an equally flexble representation of data. In the ‘old’ days when an organization wanted to collect a new data-item about, say, a Person, then a new column would first be added by a friendly database administrator to a Person Table in one’s relational database. Very inflexible.

The alternative — now widely adopted — reduces databases to a simple forumulation, one that eliminates Person and other entity-specific tables altogether. These “triple-stores” basically have just three columns — Subject, Predicate and Object — in which all data is stored. Triple-stores are often called ‘self-referential’ because first, the type of a Subject of any row in a triple-store is found in a different row (not column) in the triple-store and second, definitions of types are found in different rows of the triple-store. The benefits? Not only is the underlying structure of a triple-store unchanging, but also stand-alone metadata tables (tables describing tables) are unnecessary.

Why? Static relational database tables do work well enough to handle transactional records whose dataitems are usually well-known in advance; the rate of change in those business processes is fairly low, so that the cost of database architectures based on SQL tables is equally low. What, then, is driving the adoption of triple-stores?

The scope of business functions organizations seek to automate has enlarged considerably: the source of new information within an organization is less frequently “forms” completed by users, now more frequently raw text from documents; tweets; blogs; emails; newsfeeds; and other ‘social’ web and internal sources; which have been produced received &or retrieved by organizations.

Semantic technologies are essential components of Natural Language Processing (NLP) applications which extract and convert, for instance, all proper nouns within a text into harvestable networks of “information nodes” found in a triple-store. In fact during such harvesting, context becomes a crucial variable that can change with each sentence analyzed from the text.

Bringing us to my primary distinction between really semantic and non-semantic applications: really semantic applications mimic a human conversation, where the knowledge of an indivdual in a conversation is the result of a continuous accrual of context-specific facts, context-specific definitions, even context-specific contexts. As a direct analogy, Wittgenstein, a modern giant of philosophy, calls this phenomena Language Games to connote that one’s techniques and strategies for analysis of a game’s state and one’s actions, is not derivable in advance — it comes only during the play of the game, i.e., during processing of the text corpora.

Non-semantic applications on the other hand, are more similar to rites, where all operative dialogs are pre-written, memorized, and repeated endlessly.

This analogy to human conversations (to ‘dynamic semantics’) is hardly trivial; it is a dominant modelling technique among ontologists as evidenced by development of, for instance, Discourse Representation Theory (among others, e.g., legal communities have a similar theory, simply called Argumentation) whose rules are used to build Discourse Representation Structures from a stream of sentences that accommodate a variety of linguistic issues including plurals, tense, aspect, generalized quantifiers, anaphora and others.

“Semantic models” are an important path towards a more complete understanding of how humans, when armed with language, are able to reason and draw conclusions about the world. Relational tables, however, in themselves haven’t provided similar insight or re-purposing in different contexts. This fact alone is strong evidence that semantic methods and tools must be prominent in any organization’s technology plans.

Category: Business Intelligence, Information Development, Information Strategy, Semantic Web
1 Comment »

by: Bsomich
22  Nov  2012

Guiding Principles for the Open Semantic Enterprise

We’ve just released the seventh episode of our Open MIKE Podcast series!

Episode 07: “Guiding Principles for the Open Semantic Enterprise” features key aspects of the following MIKE2.0 solution offerings:

Semantic Enterprise Guiding Principles:

Semantic Enterprise Composite Offering:

Semantic Enterprise Wiki Category:

Check it out:

Open MIKE Podcast – Episode 07 from Jim Harris on Vimeo.


Want to get involved? Step up to the “MIKE”

We kindly invite any existing MIKE contributors to contact us if they’d like to contribute any audio or video segments for future episodes.

On Twitter? Contribute and follow the discussion via the #MIKEPodcast hashtag.

Category: Open Source, Semantic Web
1 Comment »

by: Phil Simon
24  Jun  2012

The Semantic Web Inches Closer

I’ve written before on this site about the vast implications of the forthcoming semantic web. In short, it will be a game-changer–but it certainly won’t happen anytime soon. Every day, though, I hear about organizations taking one more step in that direction. Case in point: A few days ago, Harvard announced that it was “making public the information on more than 12 million books, videos, audio recordings, images, manuscripts, maps, and more things inside its 73 libraries.” From the piece:

Harvard can’t put the actual content of much of this material online, owing to intellectual property laws, but this so-called metadata of things like titles, publication or recording dates, book sizes or descriptions of what is in videos is also considered highly valuable. Frequently descriptors of things like audio recordings are more valuable for search engines than the material itself. Search engines frequently rely on metadata over content, particularly when it cannot easily be scanned and understood.

This might not seem like a terribly big deal to the average person. Five years ago, I wouldn’t have given this announcement much thought. But think for a moment about the ramifications of such a move. After all, Harvard is a prominent institution and others will no doubt follow its lead here. More metadata from schools, publishers, record companies, music labels, and businesses mean that the web will become smarter–much smarter. Search will continue to evolve in ways that relatively few of us appreciate or think about.

Understanding Why

And let’s not forget about data mining and business intelligence. Forget about knowing more about who buys which books, although this is of enormous importance. (Ask Jeff Bezos.) Think about knowing whythese books or CDs or movies sell–or, perhaps more important, don’t sell. Consider the following questions and answers:

  • Are historical novels too long for the “average” reader? We’ll come closer to knowing because metadata includes page and word counts.
  • Which book designs result in more conversions? Are there specific fonts that readers find more appealing than others?
  • Are certain keywords registering more with a niche group of readers? We’ll know because tools will allow us to perform content and sentiment analysis.
  • Which authors’ books resonate with which readers? Executives at companies like Amazon and Apple must be frothing at the mouth here.
  • Which customers considered buying a book but ultimately did not? Why did they opt not to click the buy button?

I could go on but you get my drift. Metadata and the semantic web collectively mean that no longer will we have to look at a single book sale as a discrete event. We’ll be able to know so much more about who buys what and why. Ditto cars, MP3s, jelly beans, DVDs, and just about any other product out there.

Simon Says

In the next ten years, we still may not be able to answer every commerce-related question–or any question in its entirety. However, a more semantic web means that a significant portion of the mystery behind the purchase will be revealed. Every day, we get a little closer to a better, more semantic web.


What say you?

Tags: ,
Category: Metadata, Semantic Web
1 Comment »

by: Phil Simon
05  Jul  2011

Metadata and Collaborative Filtering

Photo from fleshforblood

In last week’s post, I discussed how Apple doesn’t sweat perfection. At a very high level, the company gets the big stuff right. In this post, I’d like to extend that concept to unstructured data and Collaborative Filtering (CF), defined by Wikipedia as:

the process of filtering for information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources, etc. Applications of collaborative filtering typically involve very large data sets. Collaborative filtering methods have been applied to many different kinds of data including sensing and monitoring data – such as in mineral exploration, environmental sensing over large areas or multiple sensors; financial data – such as financial service institutions that integrate many financial sources; or in electronic commerce and web 2.0 applications where the focus is on user data, etc.

For instance, let’s say that I am a fan of Rush, the criminally underrated Canadian power trio. In iTunes, I rate their songs, along with those from other bands. Because I enjoy Rush’s music, I am likely to like Genesis, Pink Floyd, Yes, and other 70s prog rock bands. (I do.) I rate many songs by those bands high as well, helping Apple learn about my listening habits.

Now, multiply that by millions of people. While no two people may have precisely the same taste in music (or books, apps, movies, TV shows, or art, for that matter), the technology and data collectively allow Apple to learn a great deal about group listening habits. As a result, its technology will enable relevant recommendations to me.

The Law of Large Numbers

But let’s say that some people out there have odd tastes. They like Rush but they’re also huge Beyoncé fans. (There’s nothing wrong with her or her music, it’s just that most people don’t like both her and Rush.) They give high marks to the latest Beyoncé album, as well as Rush’s most recent release. Won’t this throw off Apple’s rating systems?

Or what about those who hate Rush? Sadly, many people not only don’t share my passion for the band, but actively despise their cerebral approach to lyrics and odd time signatures. And, yes, many people hate the band because of Geddy Lee’s voice. What if they intentionally rate Pink Floyd songs high but Rush songs low? What if they rate Rush songs high and intentionally suggest songs in a completely unrelated genre such as Gangsta Rap? Won’t this make Apple’s recommendations less relevant?

In a word, no. Apple’s recommendation technology takes advantage of the law of large numbers which states, in a nutshell, that large sample sizes can withstand a few inaccurate entries, selections, and flat-out errors.

And Apple isn’t alone. Google, Facebook, Netflix, Pandora,, and a host of other companies utilize this law. Couple with accurate metadata, these companies are able to make remarkably accurate suggestions and learn a great deal about their users and customers. For more on the concept of metadata, click here.

Simon Says

Accuracy and data quality is still very important, particularly in structured data sets. One errant entry in a table can cause many problems, something that I have seen countless times. With unstructured data, however, the bar is much lower. Embrace large datasets, for they are much better able to withstand corrupt or inaccurate information.


What say you?


Category: Information Development, Semantic Web
No Comments »

Collapse Expand Close
TODAY: Fri, March 24, 2017
Collapse Expand Close
Recent Comments
Collapse Expand Close