Posts Tagged ‘metadata management’
In a previous post on this blog, I reviewed what movie ratings teach us about data quality. In this post I ponder another movie-related metaphor for information development by looking at what movie descriptions teach us about metadata.
Nightmare on Movie Night
It’s movie night. What are you in the mood for? Action Adventure? Romantic Comedy? Science Fiction? Even after you settle on a genre, picking a movie within it can feel like a scene from a Suspense Thriller. Even if you are in the mood for a Horror film, you don’t want to turn movie night into a nightmare by watching a horrible movie. You need reliable information about movies to help you make your decision. You need better movie metadata.
Tag the Movie: A Netflix Original
In his article about How Netflix Reverse Engineered Hollywood, Alexis Madrigal explained how Netflix uses large teams of people specially trained to watch movies and tag them with all kinds of metadata.
Madrigal described the process as “so sophisticated and precise that taggers receive a 36-page training document that teaches them how to rate movies on their sexually suggestive content, goriness, romance levels, and even narrative elements like plot conclusiveness. They capture dozens of different movie attributes. They even rate the moral status of characters. When these tags are combined with millions of users viewing habits, they become Netflix’s competitive advantage. The company’s main goal as a business is to gain and retain subscribers. And the genres that it displays to people are a key part of that strategy.”
The Vocabulary and Grammar of Movies
As Madrigal investigated how Netflix describes movies, he discovered a well-defined vocabulary. Standardized adjectives were consistently used (e.g., Romantic, Critically Acclaimed, Dark, Suspenseful). Phrases beginning with “Based on” revealed where the idea for a movie came from (e.g., Real Life, Children’s Book, Classic Literature). Phrases beginning with “Set in” oriented where or when a movie was set (e.g., Europe, Victorian Era, Middle East, Biblical Times). Phrases beginning with “From the” dated the decade the movie was made in. Phrases beginning with “For” recommended appropriate age ranges for children’s movies. And phrases beginning with “About” provided insight about the thematic elements of a movie (e.g., Food, Friendship, Marriage, Parenthood).
Madrigal also discovered the rules of grammar Netflix uses to piece together the components of its movie vocabulary to generate more comprehensive genres in a consistent format, such as Romantic Dramas Based on Classic Literature Set in Europe From the 1940s About Marriage (one example of which is Pride and Prejudice).
What Movie Descriptions teach us about Metadata
The movie metadata created by Netflix is conceptually similar to the music metadata created by Pandora with its Music Genome Project. The detailed knowledge of movies that Netflix has encoded as supporting metadata makes movie night magic for their subscribers.
Todd Yellin of Netflix, who guided their movie metadata production, described the goal of personalized movie recommendations as “putting the right title in front of the right person at the right time.” That is the same goal lauded in the data management industry for decades: delivering the right data to the right person at the right time.
To deliver the right data you need to know as much as possible about the data you have. More important, you need to encode that knowledge as supporting metadata.
Variety is the characteristic of big data that holds the most potential for exploitation, Edd Dumbill explained in his Forbes article Big Data Variety means that Metadata Matters. “The notion of variety in data encompasses the idea of using multiple sources of data to help understand a problem. Even the smallest business has multiple data sources they can benefit from combining. Straightforward access to a broad variety of data is a key part of a platform for driving innovation and efficiency.”
But the ability to take advantage of variety, Dumbill explained, is hampered by the fact that most “data systems are geared up to expect clean, tabular data of the sort that flows into relational database systems and data warehouses. Handling diverse and messy data requires a lot of cleanup and preparation. Four years into the era of data scientists, most practitioners report that their primary occupation is still obtaining and cleaning data sets. This forms 80% of the work required before the much-publicized investigational skill of the data scientist can be put to use.”
Which begs the question Mary Shacklett asked with her TechRepublic article Data quality: The ugly duckling of big data? “While it seems straightforward to just pull data from source systems,” Shacklett explained, “when all of this multifarious data is amalgamated into vast numbers of records needed for analytics, this is where the dirt really shows.” But somewhat paradoxically, “cleaning data can be hard to justify for ROI, because you have yet to see what clean data is going to deliver to your analytics and what the analytics will deliver to your business.”
However, Dumbill explained, “to focus on the problems of cleaning data is to ignore the primary problem. A chief obstacle for many business and research endeavors is simply locating, identifying, and understanding data sources in the first place, either internal or external to an organization.”
This is where metadata comes into play, providing a much needed context for interpreting data and helping avoid semantic inconsistencies that can stymie our understanding of data. While good metadata has alway been a necessity, big data needs even better metadata. “The documentation and description of datasets with metadata,” Dumbill explained, “enhances the discoverability and usability of data both for current and future applications, as well as forming a platform for the vital function of tracking data provenance.”
“The practices and tools of big data and data science do not stand alone in the data ecosystem,” Dumbill concluded. “The output of one step of data processing necessarily becomes the input of the next.” When approaching big data, the focus on analytics, as well as concerns about data quality, not only causes confusion about the order of those steps, but also overlooks the important role that metadata plays in the data ecosystem.
By enhancing the discoverability of data, metadata essentially replaces hide-and-seek with tag. As we prepare for a particular analysis, metadata enables us to locate and tag the data most likely to prove useful. After we tag which data we need, we can then cleanse that data to remove any intolerable defects before we begin our analysis. These three steps—tag, cleanse, analyze—form the basic framework of a big data strategy.
It all begins with metadata management. As Dumbill said, “it’s not glamorous, but it’s powerful.”
A few weeks ago, while reading about the winners at the 56th Annual Grammy Awards, I saw that Daft Punk won both Record of the Year and Album of the Year, which made me wonder what the difference is between a record and an album.
Then I read that Record of the Year is awarded to the performer and the production team of a single song. While Daft Punk won Record of the Year for their song “Get Lucky”, the song was not lucky enough to win Song of the Year (that award went to Lorde for her song “Royals”).
My confusion about the semantics of the Grammy Awards prompted a quick trip to Wikipedia, where I learned that Record of the Year is awarded for either a single or individual track from an album. This award goes to the performer and the production team for that one song. In this context, record means a particular recorded song, not its composition or an album of songs.
Although Song of the Year is also awarded for a single or individual track from an album, the recipient of this award is the songwriter who wrote the lyrics to the song. In this context, song means the song as composed, not its recording.
The Least Ambiguous Award goes to Album of the Year, which is indeed awarded for a whole album. This award goes to the performer and the production team for that album. In this context, album means a recorded collection of songs, not the individual songs or their compositions.
These distinctions, and the confusion it caused me, seemed eerily reminiscent of the challenges that happen within organizations when data is ambiguously defined. For example, terms like customer and revenue are frequently used without definition or context. When data definitions are ambiguous, it can easily lead to incorrect uses of data as well as confusing references to data during business discussions.
Not only is it difficult to reach consensus on data definitions, definitions change over time. For example, Record of the Year used to be awarded to only the performer, not the production team. And the definition of who exactly counts as a member of the production team has been changed four times over the years, most recently in 2013.
Avoiding semantic inconsistencies, such as the difference between a baker and a Baker, is an important aspect of metadata management. Be diligent with your data definitions and avoid daft definitions for sound semantics.
In his recent blog post What’s the Matter with ‘Meta’?, John Owens lamented the misuse of the term metadata — the metadata about metadata — when discussing matters within the information management industry, such as metadata’s colorful connection with data quality. As Owens explained, “the simplest definition for ‘Meta Data’ is that it is ‘data about data’. To be more precise, metadata describes the structure and format (but not the content) of the data entities of an enterprise.” Owens provided several examples in his blog post, which also received great commentary.
Some commenters resisted oversimplifying metadata as data about data, including Rob Karel who, in his recent blog post Metadata, So Mom Can Understand, explained that “at its most basic level, metadata is something that helps to better describe the data you’re trying to remember.”
This metadata crisis reminded me of the book Moonwalking with Einstein: The Art and Science of Remembering Everything, where author Joshua Foer described a strange kind of forgetfulness that psychologists have dubbed the Baker-baker Paradox.
As Foer explained it: “A researcher shows two people the same photograph of a face and tells one of them the guy is a baker and the other that his last name is Baker. A couple days later, the researcher shows the same two guys the same photograph and asks for the accompanying word. The person who was told the man’s profession is much more likely to remember it than the person who was given his surname. Why should that be? Same photograph. Same word. Different amount of remembering.”
“When you hear that the man in the photo is a baker,” Foer explained, “that fact gets embedded in a whole network of ideas about what it means to be a baker: He cooks bread, he wears a big white hat, he smells good when he comes home from work.”
“The name Baker, on the other hand,” Foer continued, “is tethered only to a memory of the person’s face. That link is tenuous, and should it dissolve, the name will float off irretrievably into the netherworld of lost memories. (When a word feels like it’s stuck on the tip of the tongue, it’s likely because we’re accessing only part of the neural network that contains the idea, but not all of it.)”
“But when it comes to the man’s profession,” Foer concluded, “there are multiple strings to reel the memory back in. Even if you don’t at first remember that the man is a baker, perhaps you get some vague sense of breadiness about him, or see some association between his face and a big white hat, or maybe you conjure up a memory of your own neighborhood bakery. There are any number of knots in that tangle of associations that can be traced back to his profession.”
Metadata makes data better, helping us untangle the knots of associations among the data and information we use everyday. Whether we be bakers or Bakers, or professions or people described by other metadata, the better we can describe ourselves and our business, the better our business will be.
Although we may not always agree on the definitions demarcating metadata, data, and information, let’s not forget that what matters most is enabling better business the best we can.
Information, data, and metadata are three interrelated words we hear a lot in the enterprise information management industry. An example of the difference, and relationship, between data and information is grapes and wine, where data is to grapes as information is to wine, meaning that information is created from data. And metadata is essential to understanding data, information, and the business and technical aspects of the processes that transform data into information.
In fact, the importance of metadata adding context all along the journey from data to information can not be overstated. As David Weinberger explained in his book Too Big to Know, “the atoms of data hook together only because they share metadata.”
Although it has always played an essential role in information development, metadata management has an even bigger role play in the era of big data and information overload.
“The solution to the information overload problem,” according to Weinberger, “is to create more information: metadata. When you put a label on a folder, you’re using metadata so that you can find the papers within it . . . just as a caption helps us make sense of a photo.”
Photos in need of captions and videos in need of categories are great examples of the growing rise of unstructured data, which is deepening our dependence on metadata. And the semi-structured data of social media (e.g., tweets with hashtags) is another example of how data without the context provided by metadata will never be able to complete its journey to information.
Of course, the journey doesn’t end with information. In 1988, Russell Ackoff, as Weinberger explained, “sketched a pyramid that has probably been redrawn on a white board somewhere in the world every hour since. The largest layer at the bottom of the pyramid represents data, followed by successively narrower layers of information, knowledge, understanding, and wisdom. The drawing makes perfect visual sense: There’s obviously plenty of data in the world, but not a lot of wisdom. Starting from mere ones and zeroes, up through what they stand for, what they mean, what sense they make, and what insight they provide, each layer acquires value from the ones below it.” Furthermore, I would argue that metadata provides the footholds allowing us to scale from one layer of the pyramid to the next.
Metadata is our guide on the journey from data to information, enabling us to understand the often complex business and technical contexts surrounding enterprise information management, and allowing us to journey further toward meaningful knowledge and actionable insight.
A lot of the fast-moving large volumes of various data swimming in the big data primordial soup are unstructured or semi-structured. Without metadata, the amino acids of data won’t combine into the protein chains of information, the building blocks of meaningful knowledge and actionable insight.
Data has always needed metadata, but as you make the business case for big data in your organization, you’d better remember the bigger your data, the better your metadata needs to be. In other words, bigger data needs better metadata.
I recently participated in a podcast to talk about why I’m involved in MIKE2.0 and how information can be turned to a company’s advantage. In summary, too many organizations are only looking at information from a defensive perspective with a focus on compliance.
Compliance in general, and for many organizations, Sarbanes-Oxley in particular, are topics that get a lot of management attention. The core of the work is to define business processes and to identify control points. When I look at the results from most companies, I see vast quantities of process documentation, often in Microsoft Visio, which has been printed into fat binders and placed on the shelf. Compliance achieved!
You don’t need me to tell you about the benefits of living documents. Any analysis which sits on the shelf is out-of-date before it is even printed. There have been many discussions about engineering systems on the back of the process documentation, however few approaches have been truly successful as inevitably there is a separation of some kind between the applications to run the process and the documentation.
I, my colleagues, and most people involved in MIKE2.0 advocate a different approach. Start by looking at the way you measure compliance, which is looking at the data which comes out of each control point. If the data is complete then the navigation between control points is actually of much less consequence (different people do their jobs differently).
When we take this data-driven approach, we also find that a complete analysis of control points also generally shows that the most valuable information held by the company is general identified. It should come as no surprise that controls provide a live feed of business crucial activities – Business Activity Monitoring (BAM)! Now we can support multiple applications providing the same data, but doing it in different ways (often this corresponds to product systems) and we can free-up business units to find creative ways to achieve the best possible business outcome.
The key to doing this successfully is to take an Information Development approach. If governance and business supervision focuses on the outcomes (measured through the control point data) rather than process steps then the company is generally more agile, able to integrate new business units more rapidly and is staffed by empowered executives.
I recently attended IBM’s Information On Demand conference in Las Vegas, including meeting with IBM’s Information Management CTO, Anant Jhingran. Anant and IBM understand the necessity of separating the content away from the application, I suspect this is why they are happy to stay out of the application space and why they are so supportive of SOA, specialist XML vendors and other forms of open communities.
Two of these XML vendors that I find particularly interesting in this context, because of their support of this “ecosystem” style of approach, are JustSystems and CoreFiling.
JustSystems, who have perhaps been known in the past as a Japanese “office” software company, have made a major push in the XML space with products like xfy which allows organizations to build process flows and dynamic datasets without having to build the full system. We find this attractive as it supports the Information Development approach of allowing prototyping focused on the content, then building a process, providing a content test platform and then (in production) providing a place to review content and manage content irrespective of the application that manages the process flow.
CoreFiling have been one of the early XBRL providers. XBRL is the emerging business reporting XML standard and is gaining rapid acceptance (particularly with regulators, hence its attraction to organizations with significant compliance obligations). CoreFiling provide products, such as SpiderMonkey which will supports the dynamic development of metadata (or taxonomies) across multiple applications and user groups, which is critical if the Information Development philosophy is to scale beyond small workgroups.
There is huge interest from clients in enterprise search, with the focus being how to create useful applications that go beyond documents or web pages. Increasingly, we’re seeing organizations that have invested in metadata for regulatory compliance discovering the value of this asset using search technologies and techniques.
The original web experience was intended to be click-based navigating via a number of hubs to any point in the internet, but the last five years has seen the majority of users move to a language-based approach starting with a site like Google or Yahoo. The example I often use is the rain radar, often when setting out to a meeting in a city I’ll check to see if rain is coming. In Melbourne I can navigate from the www.bom.gov.au website to the radar but it’s faster for me to type “Melbourne weather radar” into Google, with the added benefit that I can use the same interface when I’m in Auckland, Singapore, New York etc..
At work, users are still in the late 90′s relying on incomplete intranets and a poorly maintained web of links. The problem is primarily access to the structured repositories and even more importantly access to the structures of those repositories (ie., the metadata.
In many cases, banks have been the early adopters of metadata repositories followed by insurers and then the very large government departments. The main driver for these repositories has been compliance and (for banks) risk (Basel II). These repositories are enormously rich in content, but extremely difficult to interface to the rest of the organization’s information. Search can be the solution and I recommend the following three steps:
1. Interface to metadata repositories
In a bank, a user should be able to search for “Risk Weighted Asset” and find not only the relevant documents but also a list of the systems and databases that contain relevant data as well as appropriate controls, processes and business rules. It isn’t difficult to build interfaces between structured metadata and the search tools.
2. Interface to master data
The next step is to build an interface that allows the user to type “Assets Walmart 2005″ and find, via the metadata, appropriate queries which can then be launched in a BI tool (eg., Business Objects or Cognos). This is part of my view that search should be the kick-off point for all information analysis. Again, this sounds difficult but really isn’t, you can use the metadata repository to define the dimensions of search and emulate hints (ie., “Did you mean xyz”) to help if the user is almost on target.
3. Better analysis of the quality of search
The search index increasingly becomes an asset in its own right. Using the techniques in MIKE2.0, we can use do constant health checks on the usability and relevance of the search index itself.
One of the difficult aspects of Information Development is that organizations cannot “start over” – they need to fix the issues of the past. This means that the transition to a new model must incorporate a significant transition from the old world
Most organizations have a very poorly defined view of their current state information architecture: models are undefined, quality is unknown and ownership is unclear. A product that models the Enterprise Information Architecture and provides a path for transitioning to the future state would therefore be extremely valuable.
Its capabilities could be grouped into 2 categories:
Things that can be done today, but typically through multiple products
- Can be used to define data models and interface schemas
- Provides GUI-based transformation mapping such as XML schema mapping or O-R mapping
- Is able to profile data and content to identify accuracy, consistency or integrity issues in a once-off or ongoing fashion
- Takes the direct outputs of profiling and incorporates these into a set of transformation rules
- Helps identify data-dependent business rules and classifies rule metadata
- Has an import utility to bring in common standards
New capabilities typically not seen in products today
- An ability to assign value to information based on its economic value within an organization
- Provides an information requirements gathering capability that includes drill down and traceability mapping are available across requirements
- Provides a data mastering model that shows overlaps of information assets across the enterprise and rules for its propagation
- Provides an ownership model to assign individual responsibility for different areas of the information architecture (e.g. data stewards, data owners, CIO)
- Has a compliance feature that can be run to check adherence to regulations and recommended best practices
- Provides a collaborative capability for users to jointly work together for better Information Governace
In summary, this product would be like an advanced profiling tool, enterprise architecture modelling tool and planning, budgeting and forecasting tool in one. It would be a major advantage to organizations on their path to Information Development.
Today’s solutions for Active Metadata Integration and Model Driven Development seem to provide the starting point for this next generation product. Smaller software firms such as MetaMatrix provided some visionary ideas to begin to move organizations to model driven Information Development. The bi-directional metadata repositories provided by the major players such as IBM and Informatica area a big step in the right direction. There is, however, a significant opportunity for a product that can fill the gap that exists today.
We’re often asked to compare approaches to managing structured and unstructured data and attempts to bridge the gap between the two. Traditionally, technology practitioners who worried about unstructured data have been entirely different group to those that worried about structured data.
In fact, there are three types of data, structured, unstructured and a hybrid (records-oriented) grouping of semi-structured. They have much in common and are all part of the enterprise information landscape. In order to look at ways to leverage the relative strengths of the different types of data, it is important to first understand how they are used.
There are three primary applications of data within most enterprises.
The first is in support of operational processes. In the case of structured data, these processes are usually complex from a system perspective but often quite transactional from a human perspective. In the case of semi-structured and unstructured data, there is often less system intervention or interpretation of the data with a heavy reliance on human interpretation.
Secondly, each of the three is used for analysis. In the case of structured, it is easy to understand how the analysis is undertaken. With semi-structured/record data, analysis can be divided into aggregation of the structured components and a manual analysis of the free-text. With unstructured, analysis is usually restricted to searching for like terms and manually evaluating the documents.
Finally, all three types of data are used as a reference to back-up decisions and provide an audit trail for operational processes.
MIKE2.0 recommends approaches to governance, architecture and integration which are independent of the structure of the data itself.
The majority of effort associated with all data, regardless of its form, is gaining access to it at the time when it’s needed. In all three cases, there are processes to lookup or search the data. SQL for structured data, lookups for semi-structured and tree-oriented folders for unstructured. Increasing, the techniques for finding all three types are converging in one set of processes called Enterprise Search.
Ironically, despite the power of search, successful implementations are really mandating the implementation of common metadata and the use of a single enterprise metadata model. Again, MIKE2.0 takes the information architect through these requirements in a lot of detail.
In the future, organisations can expect to keep all three forms of data (structured, semi-structured records and unstructured documents) in the same repositories. However, there is no need to wait for this future utopia to begin leveraging all three in the same applications and managing them in a common way.
TODAY: Tue, March 28, 2017March2017