07 Jun 2014
Missed what’s been happening in the MIKE2.0 community? Check out our bi-weekly update:
Agile agile information development Amazon analytics Apple Big Data books Business Intelligence cloud Cloud computing Collaboration consultants CRM data governance data modelling data privacy data quality data standards data visualization data warehousing digital disruption ERP ETL executive issues facebook Google Hadoop IBM Information Development Information Governance information theory Information Value Internet of Things interviews metadata management Microsoft Netflix omCollab Open Source people Privacy social media SQL Twitter unstructured data
07 Jun 2014
Missed what’s been happening in the MIKE2.0 community? Check out our bi-weekly update:
28 May 2014
In his book Open Data Now: The Secret to Hot Startups, Smart Investing, Savvy Marketing, and Fast Innovation, Joel Gurin explained that Open Data and Big Data are related but very different.
While various definitions exist, Gurin noted that “all definitions of Open Data include two basic features: the data must be publicly available for anyone to use, and it must be licensed in a way that allows for its reuse. Open Data should also be in a form that makes it relatively easy to use and analyze, although there are gradations of openness. And there’s general agreement that Open Data should be free of charge or cost just a minimal amount.”
“Big Data involves processing very large datasets to identify patterns and connections in the data,” Gurin explained. “It’s made possible by the incredible amount of data that is generated, accumulated, and analyzed every day with the help of ever-increasing computer power and ever-cheaper data storage. It uses the data exhaust that all of us leave behind through our daily lives. Our mobile phones’ GPS systems report back on our location as we drive; credit card purchase records show what we buy and where; Google searches are tracked; smart meters in our homes record our energy usage. All are grist for the Big Data mill.”
Private and Passive versus Public and Purposeful
Gurin explained that Big Data tends to be private and passive, whereas Open Data tends to be public and purposeful.
“Big Data usually comes from sources that passively generate data without purpose, without direction, or without even realizing that they’re creating it. And the companies and organizations that use Big Data usually keep the data private for business or security reasons. This includes the data that large retailers hold on customers’ buying habits, that hospitals hold about their patients, and that banks hold about their credit card holders.”
By contrast, Open Data “is consciously released in a way that anyone can access, analyze, and use as he or she sees fit. Open Data is also often released with a specific purpose in mind—whether the goal is to spur research and development, fuel new businesses, improve public health and safety, or achieve any number of other objectives.”
“While Big Data and Open Data each have important commercial uses, they are very different in philosophy, goals, and practice. For example, large companies may use Big Data to analyze customer databases and target their marketing to individual customers, while they use Open Data for market intelligence and brand building.”
Big and Open Data
Gurin also noted, however, that some of the most powerful results arise when Big Data and Open Data overlap.
“Some government agencies have made very large amounts of data open with major economic benefits. National weather data and GPS data are the most often-cited examples. U.S. census data and data collected by the Securities and Exchange Commission and the Department of Health and Human Services are others. And nongovernmental research has produced large amounts of data, particularly in biomedicine, that is now being shared openly to accelerate the pace of scientific discovery.”
Data Open for Business
Gurin addressed the apparent paradox of Open Data: “If Open Data is free, how can anyone build a business on it? The answer is that Open Data is the starting point, not the endpoint, in deriving value from information.” For example, even though weather and GPS data have been available for decades, those same Open Data starting points continue to spark new ideas, generating new, and profitable, endpoints.
While data privacy still requires sensitive data not be shared without consent and competitive differentiation still requires an organization’s intellectual property not be shared, that still leaves a vast amount of other data which, if made available as Open Data, will make more data open for business.
25 May 2014
Is there a future for careers in Information Technology? Globally, professional societies such as the British Computer Society and the Australian Computer Society have long argued that practitioners need to be professionals. However, there is a counter-argument that technology is an enabler for all professions and is more generally a capability of many rather than a profession of the few.
At the same time, many parents, secondary school teachers and even tertiary educators have warned students that a Technology career is highly risky with many traditional roles being moved to lower cost countries such as India, China and The Philippines. Seeing headlines in the newspapers in recent years headlining controversy over the use imported works in local Technology roles has only served to further unsettle potential graduates.
Technologists as agents of change
Organisations increasingly realise that if they don’t encourage those who have information and insight about the future of technology in their business, they be creating a lumbering hierarchy that is incapable of change.
How should companies seek out those innovations that will enable the future business models that haven’t been invented yet? Will current technology savings cause “pollution” that will saddle future business initiatives with impossible complexity? Is the current portfolio of projects simply keeping the lights on or is it preparing for real change? Does the organisation have a group of professionals driving change in their business in the years to come or do they have a group of technicians who are responding without understanding why?
These questions deeply trouble many businesses and are leading to a greater focus on having a group of dedicated technology professionals at every level of the organisation and often dispersed through the lines of business.
The recognition of the need for these change agents should answer the question on the future of the profession. At a time when business needs innovation which can only achieved through technology, society is increasingly worried about a future where their every interaction might be tracked.
While the Information Technology profession has long talked about the ethics of information and privacy, it is only recently that society is starting to care. With the publicity around the activities of government and big business starting to cause wide concern, it is likely that the next decade will see a push towards greater ownership of data by the customer, more sophisticated privacy and what is being dubbed “forget me” legislation where companies need to demonstrate they can completely purge all record of an individual.
While every business will have access to advice at senior levels, it is those who embed Information Technology professionals at every level through their organisation that will have the ability to think ahead to the consequences of each decision.
A professional’s perspective
These decisions often form branches in the road. While requirements can often be met in different, but apparently similar paths, the difference between the fastest route and the slowest is sometimes measured in orders of magnitude. Sometimes these decisions turn out to be difference between success and failure. A seemingly innocuous choice to pick a particular building block, external interface or language can either be lauded or regretted many years later.
Ours is a career that has invited many to join from outside and the possibilities that the digital and information economy create had enticed many who have tinkered to make Information Technology their core focus. While this is a good thing, it is critical that those relying on technology skills can have confidence in the decisions that are being made both now and in the future.
Practitioners who have developed their knowledge in an ad-hoc way, without the benefit of testing their wider coverage of the discipline, are at risk of making decisions that meet immediate requirements but which cut-off options for the future or leave the organisation open to structural issues which only become apparent in decades to come. In short, these people are often good builders but poor architects.
But is there a future at all?
Casual observers of the industry can be forgiven for thinking that the constant change in technology means that skills of future practitioners will be so different to those of today as to make any professional training irrelevant. Anyone who holds this view would be well served by reading relevant Technology articles from previous eras such as the 1980s when there was a popular perception that so-called “fourth generation languages” would mean the end of computer programming.
While the technical languages of choice today are different to those of the 1970s, 80s and subsequent decades, the fundamental skills are the same. What’s more, anyone who has developed professional (as opposed to purely technical) skills as a developer using any language can rapidly transition to any new language as it becomes popular. True Technology professionals are savvy to the future using the past as their guide and make good architecture their goal.
The way forward
Certainly the teaching and foundations of Technology need to change. There has been much too much focus on current technical skills. The successful Technologist has a feel for the trends based on history and is able to pick-up any specific skill as needed through their career.
Senior executives, regardless of their role, express frustration about the cost and complexity of doing even seemingly simple things such as preparing a marketing campaign, adding a self-service capability or combining two services into one. No matter which way you look at it, it costs more to add or change even simple things in organisations due to the increasing complexity that a generation of projects have left behind as their legacy (see Value of decommissioning legacy systems).
It should come as no surprise that innovation seems to come from Greenfield start-ups, many of which have been funded by established companies whose own legacy stymies experimentation and agility.
This need to start again is neither productive nor sustainable. Once a business accepts the assertion that complexity caused by the legacy of previous projects is the enemy of agility, then they need to ask whether their Technology capabilities are adding to the complexity while solving immediate problems or if they are encouraging Technology professionals to create solutions that not only meet a need but also simplify the enterprise in preparation for an unknown tomorrow.
24 May 2014
Why estimating Data Quality profiling doesn’t have to be guess-work
Data Management lore would have us believe that estimating the amount of work involved in Data Quality analysis is a bit of a “Dark Art,” and to get a close enough approximation for quoting purposes requires much scrying, haruspicy and wet-finger-waving, as well as plenty of general wailing and gnashing of teeth. (Those of you with a background in Project Management could probably argue that any type of work estimation is just as problematic, and that in any event work will expand to more than fill the time available…).
However, you may no longer need to call on the services of Severus Snape or Mystic Meg to get a workable estimate for data quality profiling. My colleague from QFire Software, Neil Currie, recently put me onto a post by David Loshin on SearchDataManagement.com, which proposes a more structured and rational approach to estimating data quality work effort.
At first glance, the overall methodology that David proposes is reasonable in terms of estimating effort for a pure profiling exercise – at least in principle. (It’s analogous to similar “bottom/up” calculations that I’ve used in the past to estimate ETL development on a job-by-job basis, or creation of standards Business Intelligence reports on a report-by-report basis).
I would observe that David’s approach is predicated on the (big and probably optimistic) assumption that we’re only doing the profiling step. The follow-on stages of analysis, remediation and prevention are excluded – and in my experience, that’s where the real work most often lies! There is also the assumption that a pre-existing checklist of assessment criteria exists – and developing the library of quality check criteria can be a significant exercise in its own right.
However, even accepting the “profiling only” principle, I’d also offer a couple of additional enhancements to the overall approach.
Firstly, even with profiling tools, the inspection and analysis process for any “wrong” elements can go a lot further than just a 10-minute-per-item-compare-with-the-checklist, particularly in data sets with a large number of records. Also, there’s the question of root-cause diagnosis (And good DQ methods WILL go into inspecting the actual member records themselves). So for contra-indicated attributes, I’d suggest a slightly extended estimation model:
* 10mins: for each “Simple” item (standard format, no applied business rules, fewer that 100 member records)
Secondly, and more importantly – David doesn’t really allow for the human factor. It’s always people that are bloody hard work! While it’s all very well to do a profiling exercise in-and-of-itself, the result need to be shared with human beings – presented, scrutinised, questioned, validated, evaluated, verified, justified. (Then acted upon, hopefully!) And even allowing for the set-aside of the “Analysis” stages onwards, then there will need to be some form of socialisation within the “Profiling” phase.
That’s not a technical exercise – it’s about communication, collaboration and co-operation. Which means it may take an awful lot longer than just doing the tool-based profiling process!
How much socialisation? That depends on the number of stakeholders, and their nature. As a rule-of-thumb, I’d suggest the following:
* Two hours of preparation per workshop ((If the stakeholder group is “tame”. Double it if there are participants who are negatively inclined).
That’s in addition to David’s formula for estimating the pure data profiling tasks.
Detailed root-cause analysis (Validate), remediation (Protect) and ongoing evaluation (Monitor) stages are a whole other ball-game.
Alternatively, just stick with the crystal balls and goats – you might not even need to kill the goat anymore…
24 May 2014
Missed what’s been happening in the information management community? Check out our community update:
22 May 2014
I have written about several aspects of the Internet of Things (IoT) in previous posts on this blog (The Internet of Humans, The Quality of Things, and Is it finally the Year of IoT?). Two recent articles have examined a few of the things that are interrupting the progress of IoT.
In his InformationWeek article Internet of Things: What’s Holding Us Back, Chris Murphy reported that while “companies in a variety of industries — transportation, energy, heavy equipment, consumer goods, healthcare, hospitality, insurance — are getting measurable results by analyzing data collected from all manner of machines, equipment, devices, appliances, and other networked things” (of which his article notes many examples), there are things slowing the progress of IoT.
One of its myths, Murphy explained, “is that companies have all the data they need, but their real challenge is making sense of it. In reality, the cost of collecting some kinds of data remains too high, the quality of the data isn’t always good enough, and it remains difficult to integrate multiple data sources.”
The fact is a lot of the things we want to connect to IoT weren’t built for Internet connectivity, so it’s not as simple as just sticking a sensor on everything. Poor data quality, especially timeliness, but also completeness and accuracy, is impacting the usefulness of the data transmitted by things currently connected to sensors. Other issues Murphy explores in his article include the lack of wireless network ubiquity, data integration complexities, and securing IoT data centers and devices against Internet threats.
“The clearest IoT successes today,” Murphy concluded, “are from industrial projects that save companies money, rather than from projects that drive new revenue. But even with these industrial projects, companies shouldn’t underestimate the cultural change they need to manage as machines start telling veteran machine operators, train drivers, nurses, and mechanics what’s wrong, and what they should do, in their environment.”
In his Wired article Why Tech’s Best Minds Are Very Worried About the Internet of Things, Klint Finley reported on the findings of a survey about IoT from the Pew Research Center. While potential benefits of IoT were noted, such as medical devices monitoring health and environmental sensors detecting pollution, potential risks were highlighted, security being one of the most immediate concerns. “Beyond security concerns,” Finley explained, “there’s the threat of building a world that may be too complex for our own good. If you think error messages and applications crashes are a problem now, just wait until the web is embedded in everything from your car to your sneakers. Like the VCR that forever blinks 12:00, many of the technologies built into the devices of the future may never be used properly.”
The research also noted concerns about IoT expanding the digital divide, ostracizing those who can not, or choose not to, participate. Data privacy concerns were also raised, including the possibility of dehumanizing the workplace by monitoring employees through wearable devices (e.g., your employee badge being used to track your movements).
Some survey respondents hinted at a division similar to what we hear about the cloud, differentiating a public IoT from a private IoT. In the latter case, instead of sacrificing our privacy for the advantages of connected devices, there’s no reason our devices can’t connect to a private network instead of the public internet. Other survey respondents believe that IoT has been overhyped as the next big thing and while it will be useful in some areas (e.g., the military, hospitals, and prisons) it will not pervade mainstream culture and therefore will not invade our privacy as many fear.
What Say You about IoT?
Please share your viewpoint about IoT by leaving a comment below.
16 May 2014
A “foreign” colleague of mine once told me a trick his English language teacher taught him to help him remember the “questioning words” in English. (To the British, anyone who is a non-native speaker of English is “foreign.” I should also add that as a Scotsman, English is effectively my second language…).
“Five Whiskies in a Hotel” is the clue – i.e. five questioning words begin with “W” (Who, What, When, Why, Where), with one beginning with “H” (How).
These simple question words give us a great entry point when we are trying to capture the initial set of issues and concerns around data governance – what questions are important/need to be asked.
* What data/information do you want? (What inputs? What outputs? What tests/measures/criteria will be applied to confirm whether the data is fit for purpose or not?)
Clearly, each question can generate multiple answers!
Aside: in the Doric dialect of North-East of Scotland where I originally hail from, all the “question” words begin with “F”:
Whatever your native language, these key questions should get the conversation started…
Remember too, the homily by Rudyard Kipling:
13 May 2014
Let’s be honest here. Data Quality is good and worthy, but it can be a pretty dull affair at times. Information Management is something that “just happens”, and folks would rather not know the ins-and-outs of how the monthly Management Pack gets created.
Yet I’ll bet that they’ll be right on your case when the numbers are “wrong”.
So here’s an idea. The next time you want to engage someone in a discussion about data quality, don’t start by discussing data quality. Don’t mention the processes of profiling, validating or cleansing data. Don’t talk about integration, storage or reporting. And don’t even think about metadata, lineage or auditability. Yaaaaaaaaawn!!!!
Instead of concentrating on telling people about the practitioner processes (which of course are vital, and fascinating no doubt if you happen to be a practitioner), think about engaging in a manner that is relevant to the business community, using language and examples that are business-oriented. Make it fun!
Once you’ve got the discussion flowing in terms of the impacts, challenges and inhibitors that get in the way of successful business operations, then you can start to drill into the underlying data issues and their root causes. More often than not, a data quality issue is symptomatic of a business process failure rather than being an end in itself. By fixing the process problem, the business user gains a benefit, and the data in enhanced as a by-product. Everyone wins (and you didn’t even have to mention the dreaded DQ phrase!)
Data Quality is a human thing – that’s why its hard. As practitioners, we need to be communicators. Lead the thinking, identify the impact and deliver the value.
Now, that’s interesting!
12 May 2014
Just recently, Gary Allemann posted a guest article on Nicola Askham’s Blog, which made an analogy between Data Governance and the London Tube map. (Nicola also on Twitter. See also Gary Allemann’s blog, Data Quality Matters.)
Up until now, I’ve always struggled to think of a way to represent all of the different aspects of Information Management/Data Governance; the environment is multi-faceted, with the interconnections between the component capabilities being complex and not hierarchical. I’ve sometimes alluded to there being a network of relationship between elements, but this has been a fairly abstract concept that I’ve never been able to adequately illustrate.
And in a moment of perspiration, I came up with this…
I’ll be developing this further as I go but in the meantime, please let me know what you think.
(NOTE: following on from Seth Godin’ plea for more sharing of ideas, I am publishing the Information Management Tube Map under Creative Commons License Attribution Share-Alike V4.0 International. Please credit me where you use the concept, and I would appreciate it if you could reference back to me with any changes, suggestions or feedback. Thanks in advance.)
10 May 2014
Missed what’s been happening in the data management community? Check out our bi-weekly update: