Archive for the ‘Metadata’ Category
Anthropologist Robin Dunbar has used his research in primates over recent decades to argue that there is a cognitive limit to the number of social relationships that an individual can maintain and hence a natural limit to the breadth of their social group. In humans, he has proposed that this number is 150, the so-called “Dunbar’s number”.
In the modern organisation, relationships are maintained using data. It doesn’t matter whether it is the relationship between staff and their customers, tracking vendor contracts, the allocation of products to sales teams or any other of the literally thousands of relationships that exist, they are all recorded centrally and tracked through the data that they throw off.
Social structures have evolved over thousands of years using data to deal with the inability of groups of more than 150 to effectively align. One of the best examples of this is the 11th century Doomsday Book ordered by William the Conqueror. Fast forward to the 21st century and technology has allowed the alignment of businesses and even whole societies in ways that were unimaginable 50 years ago.
Just as a leadership team needs to have a group of people that they relate to that falls within the 150 of Dunbar’s number, they also need to rely on information which allows the management system to extend that span of control. For the average executive, and ultimately for the average executive leadership team, this means that they can really only keep a handle on 150 “aspects” of their business, reflected in 150 “key data elements”. These elements anchor data sets that define the organisation.
Key Data Elements
To overcome the constraints of Dunbar’s number, mid-twentieth century conglomerates relied on a hierarchy with delegated management decisions whereas most companies today have heavily centralised decision making which (mostly) delivers a substantial gain in productivity and more efficient allocation of capital. They can only do this because of the ability to share information efficiently through the introduction of information technology across all layers of the enterprise.
This sharing, though, is dependent on the ability of an executive to remember what data is important. The same constraint of the human brain to know more than 150 people also applies to the use of that information. It is reasonable to argue that the information flows have the same constraint as social relationships.
Observing hundreds of organisations over many years, the variety of key data elements is wide but their number is consistently in the range of one to a few hundred. Perhaps topping out at 500, the majority of well-run organisations have nearer to 150 elements dimensioning their most important data sets.
While decisions are made through metrics, it is the most important key data elements that make up the measures and allow them to be dimensioned.
Although organisations have literally hundreds of thousands of different data elements they record, only a very small number are central to the running of the enterprise. Arguably, the centre can only keep track of about 150 and use them as a core of managing the business.
Another way of looking at this is that the leadership team (or even the CEO) can really only have 150 close relationships. If each relationship has one assigned data set or key data element they are responsible for then the overall organisation will have 150.
Choosing the right 150
While most organisations have around 150 key data elements that anchor their most important information, few actually know what they are. That’s a pity because the choice of 150 tells you a lot about the organisation. If the 150 don’t encompass the breadth of the enterprise then you can gain insight into what’s really important to the management team. If there is little to differentiate the key data elements from those that a competitor might choose then the company may lack a clear point of difference and be overly dependent on operational excellence or cost to gain an advantage.
Any information management initiative should start by identifying the 150 most important elements. If they can’t narrow the set down below a few hundred, they should be suspicious they haven’t gotten to the core of what’s really important to their sponsors. They should then look to ask the question of whether these key data elements span the enterprise or pick organisational favourites; whether they offer differentiation or are “me too” and whether they are easy or hard for a competitor to emulate.
The identification of the 150 key data elements provides a powerful foundation for any information and business strategy. Enabling a discussion on how the organisation is led and managed. While processes evolve quickly, the information flows persist. Understanding the 150 allows a strategist to determine whether the business is living up to its strategy or if its strategy needs to be adjusted to reflect the business’s strengths.
In a previous post on this blog, I reviewed what movie ratings teach us about data quality. In this post I ponder another movie-related metaphor for information development by looking at what movie descriptions teach us about metadata.
Nightmare on Movie Night
It’s movie night. What are you in the mood for? Action Adventure? Romantic Comedy? Science Fiction? Even after you settle on a genre, picking a movie within it can feel like a scene from a Suspense Thriller. Even if you are in the mood for a Horror film, you don’t want to turn movie night into a nightmare by watching a horrible movie. You need reliable information about movies to help you make your decision. You need better movie metadata.
Tag the Movie: A Netflix Original
In his article about How Netflix Reverse Engineered Hollywood, Alexis Madrigal explained how Netflix uses large teams of people specially trained to watch movies and tag them with all kinds of metadata.
Madrigal described the process as “so sophisticated and precise that taggers receive a 36-page training document that teaches them how to rate movies on their sexually suggestive content, goriness, romance levels, and even narrative elements like plot conclusiveness. They capture dozens of different movie attributes. They even rate the moral status of characters. When these tags are combined with millions of users viewing habits, they become Netflix’s competitive advantage. The company’s main goal as a business is to gain and retain subscribers. And the genres that it displays to people are a key part of that strategy.”
The Vocabulary and Grammar of Movies
As Madrigal investigated how Netflix describes movies, he discovered a well-defined vocabulary. Standardized adjectives were consistently used (e.g., Romantic, Critically Acclaimed, Dark, Suspenseful). Phrases beginning with “Based on” revealed where the idea for a movie came from (e.g., Real Life, Children’s Book, Classic Literature). Phrases beginning with “Set in” oriented where or when a movie was set (e.g., Europe, Victorian Era, Middle East, Biblical Times). Phrases beginning with “From the” dated the decade the movie was made in. Phrases beginning with “For” recommended appropriate age ranges for children’s movies. And phrases beginning with “About” provided insight about the thematic elements of a movie (e.g., Food, Friendship, Marriage, Parenthood).
Madrigal also discovered the rules of grammar Netflix uses to piece together the components of its movie vocabulary to generate more comprehensive genres in a consistent format, such as Romantic Dramas Based on Classic Literature Set in Europe From the 1940s About Marriage (one example of which is Pride and Prejudice).
What Movie Descriptions teach us about Metadata
The movie metadata created by Netflix is conceptually similar to the music metadata created by Pandora with its Music Genome Project. The detailed knowledge of movies that Netflix has encoded as supporting metadata makes movie night magic for their subscribers.
Todd Yellin of Netflix, who guided their movie metadata production, described the goal of personalized movie recommendations as “putting the right title in front of the right person at the right time.” That is the same goal lauded in the data management industry for decades: delivering the right data to the right person at the right time.
To deliver the right data you need to know as much as possible about the data you have. More important, you need to encode that knowledge as supporting metadata.
Data profiling is an excellent diagnostic method for gaining additional understanding of the data. Profiling the source data helps inform both business requirements definition and detailed solution designs for data-related project, as well as enabling data issues to be managed ahead of project implementation.
Profiling of a data set will be measured with reference to and agreed Data Quality Dimensions (e.g. per those proposed in the recent DAMA white paper).
Profiling may be required at several levels:
• Simple profiling with a single table (e.g. Primary Key constraint violations)
• Medium complexity profiling across two or more interdependent tables (e.g. Foreign Key violations)
• Complex profiling across two or more data sets, with applied business logic (e.g. reconciliation checks)
Note that field-by-field analysis is required to truly understand the data gaps.
Any data profiling analysis must not only identify the issues and underlying root causes, but must also identify the business impact of the data quality problem (measured by effectiveness, efficiency, risk inhibitors). This will help identify any value in remediating the data – great for your data quality Business Case. Root cause analysis also helps identify any process outliers and and drives out requirements for remedial action on managing any identified exceptions.
Be sure to profile your data and take baseline measures before applying any remedial actions – this will enable you to measure the impact of any changes.
I strongly recommend Data Quality Profiling and root-cause analysis to be undertaken as an initiation activity as part of all data warehouse, master data and application migration project phases.
Over the years, I’ve tended to find that asking any individual or group the question “What data/information do you want?” gets one of two responses:
“I don’t know.” Or;
“I don’t know what you mean by that.”
End of discussion, meeting over, pack up go home, nobody is any the wiser. Result? IT makes up the requirements based on what they think the business should want, the business gets all huffy because IT doesn’t understand what they need, and general disappointment and resentment ensues.
Clearly for Information Management & Business Intelligence solutions, this is not a good thing.
So I’ve stopped asking the question. Instead, when doing requirements gathering for an information project, I go through a workshop process that follows the following outline agenda:
Context setting: Why information management / Business Intelligence / Analytics / Data Governance* is generally perceived to be a “good thing”. This is essentially a very quick précis of the BI project mandate, and should aim at putting people at ease by answering the question “What exactly are we all doing here?”
(*Delete as appropriate).
Business Function & Process discovery: What do people do in their jobs – functions & tasks? If you can get them to explain why they do those things – i.e. to what end purpose or outcome – so much the better (though this can be a stretch for many.)
Challenges: what problems or issues do they currently face in their endeavours? What prevents them from succeeding in their jobs? What would they do differently if they had the opportunity to do so?
Opportunities: What is currently good? Existing capabilities (systems, processes, resources) are in place that could be developed further or re-used/re-purposed to help achieve the desired outcomes?
Desired Actions: What should happen next?
As a consultant, I see it as part of my role to inject ideas into the workshop dialogue too, using a couple of question forms specifically designed to provoke a response:
“What would happen if…X”
“Have you thought about…Y”
“Why do you do/want…Z”.
Notice that as the workshop discussion proceeds, the participants will naturally start to explore aspects that relate to later parts of the agenda – this is entirely ok. The agenda is there to provide a framework for the discussion, not a constraint. We want people to open up and spill their guts, not clam up. (Although beware of the “rambler” who just won’t shut up but never gets to the point…)
Notice also that not once have we actively explored the “D” or “I” words. That’s because as you explore the agenda, any information requirements will either naturally fall out of the discussion as it proceed, or else you can infer the information requirements arising based on the other aspects of the discussion.
As the workshop attendees explore the different aspects of the session, you will find that the discussion will touch upon a number of different themes, which you can categorise and capture on-the-fly (I tend to do this on sheets of butchers paper tacked to the walls, so that the findings are shared and visible to all participants.). Comments will typically fall into the following broad categories:
* Functions: Things that people do as part of doing business.
* Stakeholders: people who are involved (including helpful people elsewhere in the organisation – follow up with them!)
* Inhibitors: Things that currently prevent progress (these either become immediate scope-change items if they are show-stoppers for the current initiative, or else they form additional future project opportunities to raise with management)
* Enablers: Resources to make use of (e.g. data sets that another team hold, which aren’t currently shared)
* Constraints: “non-negotiable” aspects that must be taken into account. (Note: I tend to find that all constraints are actually negotiable and can be overcome if there is enough desire, money and political will.)
* Considerations: Things to be aware of that may have an influence somewhere along the line.
* Source systems: places where data comes from
* Information requirements: Outputs that people want
Here’s a (semi) fictitious example:
e.g. ADD: “What does your team do?”
Workshop Victim Participant #1: “Well, we’re trying to reconcile the customer account balances with the individual transactions.”
ADD: And why do you wan to do that?
Workshop Victim Participant #2: “We think there’s a discrepancy in the warehouse stock balances, compared with what’s been shipped to customers. The sales guys keep their own database of customer contracts and orders and Jim’s already given us dump of the data, while finance run the accounts receivables process. But Sally the Accounts Clerk doesn’t let the numbers out under any circumstances, so basically we’re screwed.”
Functions: Sales Processing, Contract Mangement, Order Fulfilment, Stock Management, Accounts Receivable.
Stakeholders: Warehouse team, Sales team (Jim), Finance team.
Inhibitors: Finance don’t collaborate.
Enablers: Jim is helpful.
Source Systems: Stock System, Customer Database, Order Management, Finance System.
Information Requirements: Orders (Quantity & Price by Customer, by Salesman, by Stock Item), Dispatches (Quantity & Price by Customer, by Salesman, by Warehouse Clerk, by Stock Item), Financial Transactions (Value by Customer, by Order Ref)
You will also probably end up with the attendees identifying a number of immediate self-assigned actions arising from the discussion – good ideas that either haven’t occurred to them before or have sat on the “To-Do” list. That’s your workshop “value add” right there….
Workshop Victim Participant #1: “I could go and speak to the Financial Controller about getting access to the finance data. He’s more amenable to working together than Sally, who just does what she’s told.”
Happy information requirements gathering!
Why estimating Data Quality profiling doesn’t have to be guess-work
Data Management lore would have us believe that estimating the amount of work involved in Data Quality analysis is a bit of a “Dark Art,” and to get a close enough approximation for quoting purposes requires much scrying, haruspicy and wet-finger-waving, as well as plenty of general wailing and gnashing of teeth. (Those of you with a background in Project Management could probably argue that any type of work estimation is just as problematic, and that in any event work will expand to more than fill the time available…).
However, you may no longer need to call on the services of Severus Snape or Mystic Meg to get a workable estimate for data quality profiling. My colleague from QFire Software, Neil Currie, recently put me onto a post by David Loshin on SearchDataManagement.com, which proposes a more structured and rational approach to estimating data quality work effort.
At first glance, the overall methodology that David proposes is reasonable in terms of estimating effort for a pure profiling exercise – at least in principle. (It’s analogous to similar “bottom/up” calculations that I’ve used in the past to estimate ETL development on a job-by-job basis, or creation of standards Business Intelligence reports on a report-by-report basis).
I would observe that David’s approach is predicated on the (big and probably optimistic) assumption that we’re only doing the profiling step. The follow-on stages of analysis, remediation and prevention are excluded – and in my experience, that’s where the real work most often lies! There is also the assumption that a pre-existing checklist of assessment criteria exists – and developing the library of quality check criteria can be a significant exercise in its own right.
However, even accepting the “profiling only” principle, I’d also offer a couple of additional enhancements to the overall approach.
Firstly, even with profiling tools, the inspection and analysis process for any “wrong” elements can go a lot further than just a 10-minute-per-item-compare-with-the-checklist, particularly in data sets with a large number of records. Also, there’s the question of root-cause diagnosis (And good DQ methods WILL go into inspecting the actual member records themselves). So for contra-indicated attributes, I’d suggest a slightly extended estimation model:
* 10mins: for each “Simple” item (standard format, no applied business rules, fewer that 100 member records)
* 30 mins: for each “Medium” complexity item (unusual formats, some embedded business logic, data sets up to 1000 member records)
* 60 mins: for any “Hard” high-complexity items (significant, complex business logic, data sets over 1000 member records)
Secondly, and more importantly – David doesn’t really allow for the human factor. It’s always people that are bloody hard work! While it’s all very well to do a profiling exercise in-and-of-itself, the result need to be shared with human beings – presented, scrutinised, questioned, validated, evaluated, verified, justified. (Then acted upon, hopefully!) And even allowing for the set-aside of the “Analysis” stages onwards, then there will need to be some form of socialisation within the “Profiling” phase.
That’s not a technical exercise – it’s about communication, collaboration and co-operation. Which means it may take an awful lot longer than just doing the tool-based profiling process!
How much socialisation? That depends on the number of stakeholders, and their nature. As a rule-of-thumb, I’d suggest the following:
* Two hours of preparation per workshop ((If the stakeholder group is “tame”. Double it if there are participants who are negatively inclined).
* One hour face-time per workshop (Double it for “negatives”)
* One hour post-workshop write-up time per workshop
* One workshop per 10 stakeholders.
* Two days to prepare any final papers and recommendations, and present to the Steering Group/Project Board.
That’s in addition to David’s formula for estimating the pure data profiling tasks.
Detailed root-cause analysis (Validate), remediation (Protect) and ongoing evaluation (Monitor) stages are a whole other ball-game.
Alternatively, just stick with the crystal balls and goats – you might not even need to kill the goat anymore…
A “foreign” colleague of mine once told me a trick his English language teacher taught him to help him remember the “questioning words” in English. (To the British, anyone who is a non-native speaker of English is “foreign.” I should also add that as a Scotsman, English is effectively my second language…).
“Five Whiskies in a Hotel” is the clue – i.e. five questioning words begin with “W” (Who, What, When, Why, Where), with one beginning with “H” (How).
These simple question words give us a great entry point when we are trying to capture the initial set of issues and concerns around data governance – what questions are important/need to be asked.
* What data/information do you want? (What inputs? What outputs? What tests/measures/criteria will be applied to confirm whether the data is fit for purpose or not?)
* Why do you want it? (What outcomes do you hope to achieve? Does the data being requested actually support those questions & outcome? Consider Efficiency/Effectiveness/Risk Mitigation drivers for benefit.)
* When is the information required? (When is it first required? How frequently? Particular events?)
* Who is involved? (Who is the information for? Who has rights to see the data? Who is it being provided by? Who is ultimately accountable for the data – both contents and definitions? Consider multiple stakeholder groups in both recipients and providers)
* Where is the data to reside? (Where is it originating form? Where is it going to?)
* How will it be shared? (How will the mechanisms/methods work to collect/collate/integrate/store/disseminate/access/archive the data? How should it be structured & formatted? Consider Systems, Processes and Human methods.)
Clearly, each question can generate multiple answers!
Aside: in the Doric dialect of North-East of Scotland where I originally hail from, all the “question” words begin with “F”:
Fit…? (What?) e.g. “Fit dis yon feel loon wint?” (What does that silly chap want?)
Fit wye…? (Why?) e.g. “Fit wye div ye wint a’thin’?” (Why do you want everything?)
Fan…? (When?) e.g. “Fan div ye wint it?” (When you you want it?)
Fa…? (Who?) e.g. “Fa div I gie ‘is tae?” (Who do I give this to?)
Far…? (Where?) e.g. “Far aboots dis yon thingumyjig ging?” (Where exactly does that item go?)
Foo…? (How?) e.g. “Foo div ye expect me tae dae it by ‘e morn?” (How do you expect me to do it by tomorrow?)
Whatever your native language, these key questions should get the conversation started…
Remember too, the homily by Rudyard Kipling:
Is this the kind of response you get when you mention to people that you work in Data Quality?!
Let’s be honest here. Data Quality is good and worthy, but it can be a pretty dull affair at times. Information Management is something that “just happens”, and folks would rather not know the ins-and-outs of how the monthly Management Pack gets created.
Yet I’ll bet that they’ll be right on your case when the numbers are “wrong”.
So here’s an idea. The next time you want to engage someone in a discussion about data quality, don’t start by discussing data quality. Don’t mention the processes of profiling, validating or cleansing data. Don’t talk about integration, storage or reporting. And don’t even think about metadata, lineage or auditability. Yaaaaaaaaawn!!!!
Instead of concentrating on telling people about the practitioner processes (which of course are vital, and fascinating no doubt if you happen to be a practitioner), think about engaging in a manner that is relevant to the business community, using language and examples that are business-oriented. Make it fun!
Once you’ve got the discussion flowing in terms of the impacts, challenges and inhibitors that get in the way of successful business operations, then you can start to drill into the underlying data issues and their root causes. More often than not, a data quality issue is symptomatic of a business process failure rather than being an end in itself. By fixing the process problem, the business user gains a benefit, and the data in enhanced as a by-product. Everyone wins (and you didn’t even have to mention the dreaded DQ phrase!)
Data Quality is a human thing – that’s why its hard. As practitioners, we need to be communicators. Lead the thinking, identify the impact and deliver the value.
Now, that’s interesting!
Just recently, Gary Allemann posted a guest article on Nicola Askham’s Blog, which made an analogy between Data Governance and the London Tube map. (Nicola also on Twitter. See also Gary Allemann’s blog, Data Quality Matters.)
Up until now, I’ve always struggled to think of a way to represent all of the different aspects of Information Management/Data Governance; the environment is multi-faceted, with the interconnections between the component capabilities being complex and not hierarchical. I’ve sometimes alluded to there being a network of relationship between elements, but this has been a fairly abstract concept that I’ve never been able to adequately illustrate.
And in a moment of perspiration, I came up with this…
I’ll be developing this further as I go but in the meantime, please let me know what you think.
(NOTE: following on from Seth Godin’ plea for more sharing of ideas, I am publishing the Information Management Tube Map under Creative Commons License Attribution Share-Alike V4.0 International. Please credit me where you use the concept, and I would appreciate it if you could reference back to me with any changes, suggestions or feedback. Thanks in advance.)
When I was a kid growing up in the UK, Paul Daniels was THE television magician. With a combination of slick high drama illusions, close-up trickery and cheeky end-of-the-pier humour, (plus a touch of glamour courtesy of The Lovely Debbie McGee TM), Paul had millions of viewers captivated on a weekly basis and his cheeky catch-phrases are still recognised to this day.
Of course. part of the fascination of watching a magician perform is to wonder how the trick works. “How the bloody hell did he do that?” my dad would splutter as Paul Daniels performed yet another goofy gag or hair-raising stunt (no mean fear, when you’re as bald as a coot…) But most people don’t REALLY want to know the inner secrets, and ever fewer of us are inspired to spray a riffle-shuffled a pack of cards all over granny’s lunch, stick a coin up their nose or grab the family goldfish from its bowl and hide it in the folds of our nether-garments. (Um, yeah. Let’s not go there…)
Penn and Teller are great of course, because they expose the basic techniques of really old, hackneyed tricks and force more innovation within the magician community. They’re at their most engaging when they actually do something that you don’t get to see the workings of. Illusion maintained, audience entertained.
As data practitioners, I think we can learn a few of these tricks. I often see us getting too hot-and-bothered about differentiating data, master data, reference data, metadata, classification scheme, taxonomy, dimensional vs relational vs data vault modelling etc. These concepts are certainly relevant to our practitioner world, but I don’t necessarily believe they need to be exposed at the business-user level.
For example, I often hear business users talking about “creating the metadata” for an event or transaction, when they’re talking about compiling the picklist of valid descriptive values and mapping these to the contextualising descriptive information for that event (which by my reckoning, really means compiling the reference data!). But I’ve found that business people really aren’t all that bothered about the underlying structure or rigour of the modelling process.
That’s our job.
There will always be exceptions. My good friend and colleague Ben Bor is something a special case and has the talent to combine data management and magic.
But for the rest of us mere mortals, I suggest that we keep the deep discussion of data techniques for the Data Magic Circle, and just let the paying customers enjoy the show….
Being a data management practitioner can be tough.
You’re expected to work your data quality magic, solve other people’s data problems, and help people get better business outcomes. It’s a valuable, worthy and satisfying profession. But people can be infuriating and frustrating, especially when the business user isn’t taking responsibility for the quality of their own data.
It’s a bit like being a Medical Doctor in general practice.
The patent presents with some early indicative symptoms. The MD then performs a full diagnosis and recommends a course of treatment. It’s then up to the patient whether or not they take their MD’s advice…
AlanDDuncan: “Doctor, Doctor. I get very short of breath when I go upstairs.”
MD: Yes, well. Your Body Mass Index is over 30, you’ve got consistently high blood pressure, your heatbeat is arrhythmic, and cholesterol levels are off the scale.”
ADD: “So what does that mean, doctor?”
MD: “It means you’re fat, you drink like a fish, you smoke like a chimney, your diet consists of fried food and cakes and you don’t do any exercise.”
ADD: “I’m Scottish.”
MD: “You need to change your lifestyle completely, or you’re going to die.”
ADD: “Oh. So, can you give me some pills?….”
If you’re going to get healthy with your data, you’ll going to have to put the pies down, step away from the Martinis and get off the couch folks.
TODAY: Tue, March 28, 2017March2017