Archive for May, 2014
In his book Open Data Now: The Secret to Hot Startups, Smart Investing, Savvy Marketing, and Fast Innovation, Joel Gurin explained that Open Data and Big Data are related but very different.
While various definitions exist, Gurin noted that “all definitions of Open Data include two basic features: the data must be publicly available for anyone to use, and it must be licensed in a way that allows for its reuse. Open Data should also be in a form that makes it relatively easy to use and analyze, although there are gradations of openness. And there’s general agreement that Open Data should be free of charge or cost just a minimal amount.”
“Big Data involves processing very large datasets to identify patterns and connections in the data,” Gurin explained. “It’s made possible by the incredible amount of data that is generated, accumulated, and analyzed every day with the help of ever-increasing computer power and ever-cheaper data storage. It uses the data exhaust that all of us leave behind through our daily lives. Our mobile phones’ GPS systems report back on our location as we drive; credit card purchase records show what we buy and where; Google searches are tracked; smart meters in our homes record our energy usage. All are grist for the Big Data mill.”
Private and Passive versus Public and Purposeful
Gurin explained that Big Data tends to be private and passive, whereas Open Data tends to be public and purposeful.
“Big Data usually comes from sources that passively generate data without purpose, without direction, or without even realizing that they’re creating it. And the companies and organizations that use Big Data usually keep the data private for business or security reasons. This includes the data that large retailers hold on customers’ buying habits, that hospitals hold about their patients, and that banks hold about their credit card holders.”
By contrast, Open Data “is consciously released in a way that anyone can access, analyze, and use as he or she sees fit. Open Data is also often released with a specific purpose in mind—whether the goal is to spur research and development, fuel new businesses, improve public health and safety, or achieve any number of other objectives.”
“While Big Data and Open Data each have important commercial uses, they are very different in philosophy, goals, and practice. For example, large companies may use Big Data to analyze customer databases and target their marketing to individual customers, while they use Open Data for market intelligence and brand building.”
Big and Open Data
Gurin also noted, however, that some of the most powerful results arise when Big Data and Open Data overlap.
“Some government agencies have made very large amounts of data open with major economic benefits. National weather data and GPS data are the most often-cited examples. U.S. census data and data collected by the Securities and Exchange Commission and the Department of Health and Human Services are others. And nongovernmental research has produced large amounts of data, particularly in biomedicine, that is now being shared openly to accelerate the pace of scientific discovery.”
Data Open for Business
Gurin addressed the apparent paradox of Open Data: “If Open Data is free, how can anyone build a business on it? The answer is that Open Data is the starting point, not the endpoint, in deriving value from information.” For example, even though weather and GPS data have been available for decades, those same Open Data starting points continue to spark new ideas, generating new, and profitable, endpoints.
While data privacy still requires sensitive data not be shared without consent and competitive differentiation still requires an organization’s intellectual property not be shared, that still leaves a vast amount of other data which, if made available as Open Data, will make more data open for business.
Is there a future for careers in Information Technology? Globally, professional societies such as the British Computer Society and the Australian Computer Society have long argued that practitioners need to be professionals. However, there is a counter-argument that technology is an enabler for all professions and is more generally a capability of many rather than a profession of the few.
At the same time, many parents, secondary school teachers and even tertiary educators have warned students that a Technology career is highly risky with many traditional roles being moved to lower cost countries such as India, China and The Philippines. Seeing headlines in the newspapers in recent years headlining controversy over the use imported works in local Technology roles has only served to further unsettle potential graduates.
Technologists as agents of change
Organisations increasingly realise that if they don’t encourage those who have information and insight about the future of technology in their business, they be creating a lumbering hierarchy that is incapable of change.
How should companies seek out those innovations that will enable the future business models that haven’t been invented yet? Will current technology savings cause “pollution” that will saddle future business initiatives with impossible complexity? Is the current portfolio of projects simply keeping the lights on or is it preparing for real change? Does the organisation have a group of professionals driving change in their business in the years to come or do they have a group of technicians who are responding without understanding why?
These questions deeply trouble many businesses and are leading to a greater focus on having a group of dedicated technology professionals at every level of the organisation and often dispersed through the lines of business.
The recognition of the need for these change agents should answer the question on the future of the profession. At a time when business needs innovation which can only achieved through technology, society is increasingly worried about a future where their every interaction might be tracked.
While the Information Technology profession has long talked about the ethics of information and privacy, it is only recently that society is starting to care. With the publicity around the activities of government and big business starting to cause wide concern, it is likely that the next decade will see a push towards greater ownership of data by the customer, more sophisticated privacy and what is being dubbed “forget me” legislation where companies need to demonstrate they can completely purge all record of an individual.
While every business will have access to advice at senior levels, it is those who embed Information Technology professionals at every level through their organisation that will have the ability to think ahead to the consequences of each decision.
A professional’s perspective
These decisions often form branches in the road. While requirements can often be met in different, but apparently similar paths, the difference between the fastest route and the slowest is sometimes measured in orders of magnitude. Sometimes these decisions turn out to be difference between success and failure. A seemingly innocuous choice to pick a particular building block, external interface or language can either be lauded or regretted many years later.
Ours is a career that has invited many to join from outside and the possibilities that the digital and information economy create had enticed many who have tinkered to make Information Technology their core focus. While this is a good thing, it is critical that those relying on technology skills can have confidence in the decisions that are being made both now and in the future.
Practitioners who have developed their knowledge in an ad-hoc way, without the benefit of testing their wider coverage of the discipline, are at risk of making decisions that meet immediate requirements but which cut-off options for the future or leave the organisation open to structural issues which only become apparent in decades to come. In short, these people are often good builders but poor architects.
But is there a future at all?
Casual observers of the industry can be forgiven for thinking that the constant change in technology means that skills of future practitioners will be so different to those of today as to make any professional training irrelevant. Anyone who holds this view would be well served by reading relevant Technology articles from previous eras such as the 1980s when there was a popular perception that so-called “fourth generation languages” would mean the end of computer programming.
While the technical languages of choice today are different to those of the 1970s, 80s and subsequent decades, the fundamental skills are the same. What’s more, anyone who has developed professional (as opposed to purely technical) skills as a developer using any language can rapidly transition to any new language as it becomes popular. True Technology professionals are savvy to the future using the past as their guide and make good architecture their goal.
The way forward
Certainly the teaching and foundations of Technology need to change. There has been much too much focus on current technical skills. The successful Technologist has a feel for the trends based on history and is able to pick-up any specific skill as needed through their career.
Senior executives, regardless of their role, express frustration about the cost and complexity of doing even seemingly simple things such as preparing a marketing campaign, adding a self-service capability or combining two services into one. No matter which way you look at it, it costs more to add or change even simple things in organisations due to the increasing complexity that a generation of projects have left behind as their legacy (see Value of decommissioning legacy systems).
It should come as no surprise that innovation seems to come from Greenfield start-ups, many of which have been funded by established companies whose own legacy stymies experimentation and agility.
This need to start again is neither productive nor sustainable. Once a business accepts the assertion that complexity caused by the legacy of previous projects is the enemy of agility, then they need to ask whether their Technology capabilities are adding to the complexity while solving immediate problems or if they are encouraging Technology professionals to create solutions that not only meet a need but also simplify the enterprise in preparation for an unknown tomorrow.
Why estimating Data Quality profiling doesn’t have to be guess-work
Data Management lore would have us believe that estimating the amount of work involved in Data Quality analysis is a bit of a “Dark Art,” and to get a close enough approximation for quoting purposes requires much scrying, haruspicy and wet-finger-waving, as well as plenty of general wailing and gnashing of teeth. (Those of you with a background in Project Management could probably argue that any type of work estimation is just as problematic, and that in any event work will expand to more than fill the time available…).
However, you may no longer need to call on the services of Severus Snape or Mystic Meg to get a workable estimate for data quality profiling. My colleague from QFire Software, Neil Currie, recently put me onto a post by David Loshin on SearchDataManagement.com, which proposes a more structured and rational approach to estimating data quality work effort.
At first glance, the overall methodology that David proposes is reasonable in terms of estimating effort for a pure profiling exercise – at least in principle. (It’s analogous to similar “bottom/up” calculations that I’ve used in the past to estimate ETL development on a job-by-job basis, or creation of standards Business Intelligence reports on a report-by-report basis).
I would observe that David’s approach is predicated on the (big and probably optimistic) assumption that we’re only doing the profiling step. The follow-on stages of analysis, remediation and prevention are excluded – and in my experience, that’s where the real work most often lies! There is also the assumption that a pre-existing checklist of assessment criteria exists – and developing the library of quality check criteria can be a significant exercise in its own right.
However, even accepting the “profiling only” principle, I’d also offer a couple of additional enhancements to the overall approach.
Firstly, even with profiling tools, the inspection and analysis process for any “wrong” elements can go a lot further than just a 10-minute-per-item-compare-with-the-checklist, particularly in data sets with a large number of records. Also, there’s the question of root-cause diagnosis (And good DQ methods WILL go into inspecting the actual member records themselves). So for contra-indicated attributes, I’d suggest a slightly extended estimation model:
* 10mins: for each “Simple” item (standard format, no applied business rules, fewer that 100 member records)
* 30 mins: for each “Medium” complexity item (unusual formats, some embedded business logic, data sets up to 1000 member records)
* 60 mins: for any “Hard” high-complexity items (significant, complex business logic, data sets over 1000 member records)
Secondly, and more importantly – David doesn’t really allow for the human factor. It’s always people that are bloody hard work! While it’s all very well to do a profiling exercise in-and-of-itself, the result need to be shared with human beings – presented, scrutinised, questioned, validated, evaluated, verified, justified. (Then acted upon, hopefully!) And even allowing for the set-aside of the “Analysis” stages onwards, then there will need to be some form of socialisation within the “Profiling” phase.
That’s not a technical exercise – it’s about communication, collaboration and co-operation. Which means it may take an awful lot longer than just doing the tool-based profiling process!
How much socialisation? That depends on the number of stakeholders, and their nature. As a rule-of-thumb, I’d suggest the following:
* Two hours of preparation per workshop ((If the stakeholder group is “tame”. Double it if there are participants who are negatively inclined).
* One hour face-time per workshop (Double it for “negatives”)
* One hour post-workshop write-up time per workshop
* One workshop per 10 stakeholders.
* Two days to prepare any final papers and recommendations, and present to the Steering Group/Project Board.
That’s in addition to David’s formula for estimating the pure data profiling tasks.
Detailed root-cause analysis (Validate), remediation (Protect) and ongoing evaluation (Monitor) stages are a whole other ball-game.
Alternatively, just stick with the crystal balls and goats – you might not even need to kill the goat anymore…
Missed what’s been happening in the information management community? Check out our community update:
Big Data Solution Offering
Have you seen the latest offering in our Composite Solutions suite?
The Big Data Solution Offering provides an approach for storing, managing and accessing data of very high volumes, variety or complexity.
Storing large volumes of data from a large variety of data sources in traditional relational data stores is cost-prohibitive, and regular data modeling approaches and statistical tools cannot handle data structures with such high complexity. This solution offering discusses new types of data management systems based onNoSQL database management systems and MapReduce as the typical programming model and access method.
Read our Executive Summary for an overview of this solution.
We hope you find this new offering of benefit and welcome any suggestions you may have to improve it.
This Week’s Blogs for Thought:
Key Questions to Get Started with Data Governance
“Five Whiskies in a Hotel” is the clue – i.e. five questioning words begin with “W” (Who, What, When, Why, Where), with one beginning with “H” (How). These simple question words give us a great entry point when we are trying to capture the initial set of issues and concerns around data governance – what questions are important/need to be asked.
Things Interrupting the Internet of Things
I have written about several aspects of the Internet of Things (IoT) in previous posts on this blog (The Internet of Humans, The Quality of Things, and Is it finally the Year of IoT?). Two recent articles have examined a few of the things that are interrupting the progress of IoT. In his InformationWeek article Internet of Things: What’s Holding Us Back, Chris Murphy reported that while “companies in a variety of industries — transportation, energy, heavy equipment, consumer goods, healthcare, hospitality, insurance — are getting measurable results by analyzing data collected from all manner of machines, equipment, devices, appliances, and other networked things” (of which his article notes many examples), there are things slowing the progress of IoT.
Is Your Data Quality Boring?
Let’s be honest here. Data Quality is good and worthy, but it can be a pretty dull affair at times. Information Management is something that “just happens”, and folks would rather not know the ins-and-outs of how the monthly Management Pack gets created. Yet I’ll bet that they’ll be right on your case when the numbers are “wrong.” Right?
So here’s an idea. The next time you want to engage someone in a discussion about data quality, don’t start by discussing data quality. Don’t mention the processes of profiling, validating or cleansing data. Don’t talk about integration, storage or reporting. And don’t even think about metadata, lineage or auditability.
Forward this message to a friendQuestions?
If you have any questions, please email us at email@example.com.
I have written about several aspects of the Internet of Things (IoT) in previous posts on this blog (The Internet of Humans, The Quality of Things, and Is it finally the Year of IoT?). Two recent articles have examined a few of the things that are interrupting the progress of IoT.
In his InformationWeek article Internet of Things: What’s Holding Us Back, Chris Murphy reported that while “companies in a variety of industries — transportation, energy, heavy equipment, consumer goods, healthcare, hospitality, insurance — are getting measurable results by analyzing data collected from all manner of machines, equipment, devices, appliances, and other networked things” (of which his article notes many examples), there are things slowing the progress of IoT.
One of its myths, Murphy explained, “is that companies have all the data they need, but their real challenge is making sense of it. In reality, the cost of collecting some kinds of data remains too high, the quality of the data isn’t always good enough, and it remains difficult to integrate multiple data sources.”
The fact is a lot of the things we want to connect to IoT weren’t built for Internet connectivity, so it’s not as simple as just sticking a sensor on everything. Poor data quality, especially timeliness, but also completeness and accuracy, is impacting the usefulness of the data transmitted by things currently connected to sensors. Other issues Murphy explores in his article include the lack of wireless network ubiquity, data integration complexities, and securing IoT data centers and devices against Internet threats.
“The clearest IoT successes today,” Murphy concluded, “are from industrial projects that save companies money, rather than from projects that drive new revenue. But even with these industrial projects, companies shouldn’t underestimate the cultural change they need to manage as machines start telling veteran machine operators, train drivers, nurses, and mechanics what’s wrong, and what they should do, in their environment.”
In his Wired article Why Tech’s Best Minds Are Very Worried About the Internet of Things, Klint Finley reported on the findings of a survey about IoT from the Pew Research Center. While potential benefits of IoT were noted, such as medical devices monitoring health and environmental sensors detecting pollution, potential risks were highlighted, security being one of the most immediate concerns. “Beyond security concerns,” Finley explained, “there’s the threat of building a world that may be too complex for our own good. If you think error messages and applications crashes are a problem now, just wait until the web is embedded in everything from your car to your sneakers. Like the VCR that forever blinks 12:00, many of the technologies built into the devices of the future may never be used properly.”
The research also noted concerns about IoT expanding the digital divide, ostracizing those who can not, or choose not to, participate. Data privacy concerns were also raised, including the possibility of dehumanizing the workplace by monitoring employees through wearable devices (e.g., your employee badge being used to track your movements).
Some survey respondents hinted at a division similar to what we hear about the cloud, differentiating a public IoT from a private IoT. In the latter case, instead of sacrificing our privacy for the advantages of connected devices, there’s no reason our devices can’t connect to a private network instead of the public internet. Other survey respondents believe that IoT has been overhyped as the next big thing and while it will be useful in some areas (e.g., the military, hospitals, and prisons) it will not pervade mainstream culture and therefore will not invade our privacy as many fear.
What Say You about IoT?
Please share your viewpoint about IoT by leaving a comment below.
A “foreign” colleague of mine once told me a trick his English language teacher taught him to help him remember the “questioning words” in English. (To the British, anyone who is a non-native speaker of English is “foreign.” I should also add that as a Scotsman, English is effectively my second language…).
“Five Whiskies in a Hotel” is the clue – i.e. five questioning words begin with “W” (Who, What, When, Why, Where), with one beginning with “H” (How).
These simple question words give us a great entry point when we are trying to capture the initial set of issues and concerns around data governance – what questions are important/need to be asked.
* What data/information do you want? (What inputs? What outputs? What tests/measures/criteria will be applied to confirm whether the data is fit for purpose or not?)
* Why do you want it? (What outcomes do you hope to achieve? Does the data being requested actually support those questions & outcome? Consider Efficiency/Effectiveness/Risk Mitigation drivers for benefit.)
* When is the information required? (When is it first required? How frequently? Particular events?)
* Who is involved? (Who is the information for? Who has rights to see the data? Who is it being provided by? Who is ultimately accountable for the data – both contents and definitions? Consider multiple stakeholder groups in both recipients and providers)
* Where is the data to reside? (Where is it originating form? Where is it going to?)
* How will it be shared? (How will the mechanisms/methods work to collect/collate/integrate/store/disseminate/access/archive the data? How should it be structured & formatted? Consider Systems, Processes and Human methods.)
Clearly, each question can generate multiple answers!
Aside: in the Doric dialect of North-East of Scotland where I originally hail from, all the “question” words begin with “F”:
Fit…? (What?) e.g. “Fit dis yon feel loon wint?” (What does that silly chap want?)
Fit wye…? (Why?) e.g. “Fit wye div ye wint a’thin’?” (Why do you want everything?)
Fan…? (When?) e.g. “Fan div ye wint it?” (When you you want it?)
Fa…? (Who?) e.g. “Fa div I gie ‘is tae?” (Who do I give this to?)
Far…? (Where?) e.g. “Far aboots dis yon thingumyjig ging?” (Where exactly does that item go?)
Foo…? (How?) e.g. “Foo div ye expect me tae dae it by ‘e morn?” (How do you expect me to do it by tomorrow?)
Whatever your native language, these key questions should get the conversation started…
Remember too, the homily by Rudyard Kipling:
Is this the kind of response you get when you mention to people that you work in Data Quality?!
Let’s be honest here. Data Quality is good and worthy, but it can be a pretty dull affair at times. Information Management is something that “just happens”, and folks would rather not know the ins-and-outs of how the monthly Management Pack gets created.
Yet I’ll bet that they’ll be right on your case when the numbers are “wrong”.
So here’s an idea. The next time you want to engage someone in a discussion about data quality, don’t start by discussing data quality. Don’t mention the processes of profiling, validating or cleansing data. Don’t talk about integration, storage or reporting. And don’t even think about metadata, lineage or auditability. Yaaaaaaaaawn!!!!
Instead of concentrating on telling people about the practitioner processes (which of course are vital, and fascinating no doubt if you happen to be a practitioner), think about engaging in a manner that is relevant to the business community, using language and examples that are business-oriented. Make it fun!
Once you’ve got the discussion flowing in terms of the impacts, challenges and inhibitors that get in the way of successful business operations, then you can start to drill into the underlying data issues and their root causes. More often than not, a data quality issue is symptomatic of a business process failure rather than being an end in itself. By fixing the process problem, the business user gains a benefit, and the data in enhanced as a by-product. Everyone wins (and you didn’t even have to mention the dreaded DQ phrase!)
Data Quality is a human thing – that’s why its hard. As practitioners, we need to be communicators. Lead the thinking, identify the impact and deliver the value.
Now, that’s interesting!
Just recently, Gary Allemann posted a guest article on Nicola Askham’s Blog, which made an analogy between Data Governance and the London Tube map. (Nicola also on Twitter. See also Gary Allemann’s blog, Data Quality Matters.)
Up until now, I’ve always struggled to think of a way to represent all of the different aspects of Information Management/Data Governance; the environment is multi-faceted, with the interconnections between the component capabilities being complex and not hierarchical. I’ve sometimes alluded to there being a network of relationship between elements, but this has been a fairly abstract concept that I’ve never been able to adequately illustrate.
And in a moment of perspiration, I came up with this…
I’ll be developing this further as I go but in the meantime, please let me know what you think.
(NOTE: following on from Seth Godin’ plea for more sharing of ideas, I am publishing the Information Management Tube Map under Creative Commons License Attribution Share-Alike V4.0 International. Please credit me where you use the concept, and I would appreciate it if you could reference back to me with any changes, suggestions or feedback. Thanks in advance.)
Missed what’s been happening in the data management community? Check out our bi-weekly update:
Did You Know?
MIKE’s Integrated Content Repository brings together the open assets from the MIKE2.0 Methodology, shared assets available on the internet and internally held assets. The Integrated Content Repository is a virtual hub of assets that can be used by an Information Management community, some of which are publicly available and some of which are held internally.
Any organisation can follow the same approach and integrate their internally held assets to the open standard provided by MIKE2.0 in order to:
- Build community
- Create a common standard for Information Development
- Share leading intellectual property
- Promote a comprehensive and compelling set of offerings
- Collaborate with the business units to integrate messaging and coordinate sales activities
- Reduce costs through reuse and improve quality through known assets
The Integrated Content Repository is a true Enterprise 2.0 solution: it makes use of the collaborative, user-driven content built using Web 2.0 techniques and technologies on the MIKE2.0 site and incorporates it internally into the enterprise. The approach followed to build this repository is referred to as a mashup.
Feel free to try it out when you have a moment- we’re always open to new content ideas.
This Week’s Blogs for Thought:
An Open Source Solution for Better Performance Management
Today, many organizations are facing increased scrutiny and a higher level of overall performance expectation from internal and external stakeholders. Both business and public sector leaders must provide greater external and internal transparancy to their activities, ensure accounting data faces up to compliance challenges, and extract the return and competitive advantage out of their customer, operational and performance information.
Finding Ugly Data with Pretty Pictures
While data visualization often produces pretty pictures, as Phil Simon explained in his new book The Visual Organization: Data Visualization, Big Data, and the Quest for Better Decisions, “data visualization should not be confused with art. Clarity, utility, and user-friendliness are paramount to any design aesthetic.” Bad data visualizations are even worse than bad art since, as Simon says, “they confuse people more than they convey information.”
Is the cloud really safe?
With the rise of Apple’s iCloud and file-sharing solutions such as Dropbox and Sharepoint, it’s no surprise that SaaS and cloud computing solutions are increasing in popularity. From reduced costs in hosting and infrastructure to increased mobility, backups and accessibility, the benefits are profound. Need proof? A recent study by IDC estimates that spending on public IT cloud services will expand at a compound annual growth rate of 27.6%, from $21.5 billion in 2010 to $72.9 billion in 2015.
|Forward this message to a friend
Today, many organizations are facing increased scrutiny and a higher level of overall performance expectation from internal and external stakeholders. Both business and public sector leaders must provide greater external and internal transparancy to their activities, ensure accounting data faces up to compliance challenges, and extract the return and competitive advantage out of their customer, operational and performance information: Managers, investors and regulators have a new perspective on performance and compliance, which may include:
- No tolerance of any gap in financial controls
- Increased personal accountability for the accuracy of results
- Focus on fair representation and performance over accounting standards
- Increased need to understand the drivers of both financial and operational results
Senior executives across industry agree that achieving growth and performance objectives requires improved analytics, defining and implementing a process to track, monitor and measure innovation and performance, and making available more timely, accurate information for decision-making. MIKE2.0′s open source Enterprise Performance Management offering can help organizations achieve these critical objectives.
Learn more about this solution.
TODAY: Fri, March 24, 2017May2014