Archive for the ‘Information Management’ Category
Although I have been focusing on writing for the last three years, I still do my fair share of consulting. I find that my consulting experiences frequently fuel my writing, and it’s my hope that the lessons of my posts help people and organizations avoid many of the mistakes that I regrettably continue to see.
Here’s another lesson from the data management trenches.
One of my clients was making some changes to its internal processing based upon implementing a new application, and the organization brought me in to do some data validation. In the course of my work, a few unexpected problems occurred because the organization’s systems are pretty integrated. In other words, like squeezing a balloon, making a change in one part of an application correctly–but unexpectedly–caused changes in some other areas. This sets off what I’ll term here as an IM Chain, consisting of four parts:
- the problem
- the cause
- the fix
- the fallout
A Very Simple Model: The IM Chain
The IM Chain can be represented simply and visually as follows:
Let’s look at the chain in some more depth. First, there’s the IM problem. Consider that:
- The problem may not be noticed at all.
- Some people may refuse to identify the problem–or the depth of the problem.
- The problem may be underestimated.
- The problem may be noticed too late.
- The problem may be noticed by people you’d ideally like to avoid (read: regulators, attorneys, and your competition).
Then, of course, there’s the cause:
- The cause is sometimes difficult to isolate, particularly if an end user is not knowledgeable of the other parts of the system–or other systems.
- Different entities within and outside of the organization often disagree on the cause of the problem.
- Some may refuse to identify the cause of the problem, even when faced with irrefutable proof.
Next up is the fix:
- The fix could be very expensive and time- and resource-intensive.
- The right resources might not be available to fix the problem.
- Different entities within the organization might disagree on the “right” fix.
- Legal or regulatory hurdles might compromise the proposed fix.
- The fix could come too late.
Finally, there’s the fallout:
- The fix could break other things. Can someone say Pandora’s Box?
- The “right” fix may not be politically acceptable inside an organization.
In almost all instances, integrated systems and applications are far superior to their standalone equivalents. Lamentably, far too many organizations continue to use and support multiple and disparate legacy applications because of one reason or another. While I’m hardly a fan of multiple systems, records, and applications, oddly, there is often one major benefit to these types of data silos: the IM Chain may cease to exist.
Understand that, if all of your systems talk to each other (as is increasingly common via open APIs), you’re going to have to deal with the IM Chain. Ask yourself if making one change is going to precipitate others. Don’t wait until it actually does.
What say you?
Fast Company recently ran a fantastic article on the success and futures of Amazon, Apple, Facebook, and Google. These companies do so many things really well, not the least of which is their their astonishing levels of data management. From the piece:
Data is like mother’s milk for [these companies]. Data not only fuels new and better advertising systems (which Google and Facebook depend on) but better insights into what you’d like to buy next (which Amazon and Apple want to know). Data also powers new inventions: Google’s voice-recognition system, its traffic maps, and its spell-checker are all based on large-scale, anonymous customer tracking. These three ideas feed one another in a continuous (and often virtuous) loop. Post-PC devices are intimately connected to individual users. Think of this: You have a family desktop computer, but you probably don’t have a family Kindle. E-books are tied to a single Amazon account and can be read by one person at a time.
In a word, wow.
Consider what Amazon, Apple, Facebook, and Google (aka the Gang of Four) do with their data in relation to the average large organization. By way of stark contrast, at a recent conference I attended, DataFlux CEO Tony Fisher described how most companies need a full two days to gather a list of their customers.
Think about that.
When I heard that statistic, I couldn’t help but wonder about the following questions:
- Is this list of customers ultimately accurate?
- Why does this take so long? Why can’t someone just run a report?
- How many organizations are trying to fix this–especially those that take two weeks or more?
- What about other types of lists (read: products, employees, vendors, etc.)?
- What kind of resources are involved in cobbling together these types of reports?
- How can an organization understand its customers’ motivations, preferences, and purchasing habits when, as is too often the case, even the definition of the term customer is in dispute?
- Most important, what if the organization managed its data better and its data were more accurate, what else could it do with the time and resources required to “keep the lights on”?
Ah, good old opportunity cost. Think about what Amazon can do because it knows exactly who its customers are, which products they buy and when, and (increasingly) why they buy. Bezos and company waste no time and resources in being able to immediately pull accurate and comprehensive lists of who bought what and when.
How can an organization understand its customers’ motivations, preferences, and purchasing habits when, as is too often the case, even the definition of the term customer is in dispute?
Necessary and Sufficient
For good reason, the Gang of Four keeps its internal methods and systems pretty much under wraps. Even people who have written books about each company have had difficulty speaking with key internal players, as Richard Brandt (author of a forthcoming book on Amazon) recently told me.
However, this much I can write without fear of accurate contradiction: each did not achieve its level of success by poorly managing its data. Put differently, in the Age of the Platform, excellent data management is becoming a necessary–but insufficient–condition for success these days.
This is not 1995; companies don’t buy even staple products such as Microsoft Windows or Office because no legitimate alternatives exist. “Have to” is increasingly being replaced with “want to.” You won’t know the difference between the two unless you know your customers.
What say you?
In an interview in Inc. Magazine, Steve Jobs once said, “You can’t just ask customers what they want and then try to give that to them. By the time you get it built, they’ll want something new.”
Jobs’ wisdom and creative genius is the subject of many books–and doubtless many more. In this post, I’d like to discuss whether his statement above applies to the world of information management. In other words, can organizations respond to line of business (LOB) information management needs in time?
It’s a big question for a blog post, but I’m feeling ambitious today.
In the 1980s, the answer was generally yes. Business needs were fairly static (at least relative to today). Organizations by and large used the Waterfall method of deploying software, meticulously gathering requirements and implementing solutions in a very sequential manner. Protracted and often contentious projects often ensued, but organizations could often wait three years or more for a new application to go live. It was a different and much slower time.
We’re Not in Kansas Anymore
Fast forward to the 2010s and things could not be any more different. Ask most CIOs if three years is an acceptable amount of time to deploy a new system or technology. The answer is almost always no. Yes, major renovations and fundamental changes in architecture (read: embracing the cloud) still take time, but organizations simply can’t wait as long as they did in the 1980s because the world has changed so drastically. Things happen now at light speed.
As an aside, actor George Clooney recently remarked on Charlie Rose that movies reviews via tweets and Facebook updates are now available immediately, not on the following Monday. As a result, the grace period previously afforded bad movies of a weekend or more has been eliminated. If a movie bombs, the world will know by Friday night.
So, what’s an organization to do amidst such rapid technological change? Business needs now may not meet requirements gather six months or a year ago, much less longer than that.
Enter increasingly popular Agile methods. Rather than waiting for the big bang, features are implemented as they are ready, often while others are clearly not. Robert Stroud recently and succinctly defined the difference between Agile and Waterfall methodologies as follows: ”Waterfall is requirements-bound and Agile is time-bound”.
Lest you think that Agile methods are the sole purview of smaller, more nimble companies, larger organizations are getting on the Agile train. No doubt that the web makes this easier, as IT departments no longer have to install partially ready applications on thousands of workstations (a laborious process). Rather, features are instantly deployed via the web and users can access the latest bells and whistles seamlessly through their browsers.
So, let’s return to the initial query in this post: Can organizations respond to line of business (LOB) information management needs in time?
The simple answer is it depends. To be sure, Agile is no panacea. Bad data, misunderstood requirements, and cultural issues can derail even the most promising projects.
Still, the pros of Agile outweigh their cons. Organizations that cling to antiquated deployment methods will clearly not be able to meet the LOB information managements needs in any reasonable time. The web, open source, and the cloud all give organizations the ability to reduce product implementation times. It’s a new world, and the Waterfall is often poorly suited for it.
What say you?
I recently had the opportunity to visit four small businesses in a consulting capacity. Along with two other experts, I met some pretty dynamic little companies across the United States. These other experts provided consulting around different types of marketing, customer communications, and product positioning. When it was my turn to engage these small business’ owners, I talked about website design, technology, and data.
No surprise here. That’s my bread and butter.
Here’s another non-surprise: Many small businesses owners–and, by extension, employees at small businesses–do not think in terms of information management. That is, things like managing customer data are typically very informal processes done in a mostly disorganized manner.
Of course, there are some pretty big problems with this. Perhaps the biggest is scale. As these small businesses and their client bases begin to grow, their current manual tools prove far less useful than they had been. For instance, while it may be easy to remember which of your 50 prospective customers responded to an offer in your monthly newsletter, it’s much harder to do the same with 500 or 5,000 prospects. Over the course of these consulting sessions, I suggested that these businesses strongly consider adopting–and utilizing customer relationship management (CRM) applications.
Parallels for Large Enterprises
Remember that small business owners are the antitheses of larger enterprises: the former don’t have nearly as many departments, resources, employees, and the like. As a result, they may be loathe to institutionalize CRM or another data-driven application or process because of the perceived IT resources required.
This is a misperception because, as Brenda Somich points out, hosted CRM is growing. The explosion of the cloud and legitimate alternatives to traditionally on-premise applications mean that small businesses can do a great deal more with fewer resources. They need not break the bank to run a true CRM application–and reap the rewards of doing so.
But there’s a larger point in this post. You may think that small businesses and many proper enterprises have little in common.
And you would often be wrong.
Lamentably, many big companies either intentionally ignore or never get around to developing a future state vision for information management. As for those that have a strategy in place, their daily actions often belie this vision.
As your organization grows, stop and ask yourself:
- Are projects, applications, data, and information being managed in completely random ways?
- Are individual actions are coordinated?
- Is everyone on the same page? If not, why not?
- And how can the different parts of the organization work in concert to maximize compliance, governance, efficiency, and performance?
In other words, ask yourself if it’s time to implement an information management strategy.
With respect to information management, the goal of any organization of any size is not to develop a strategy. Rather, it should be to realize the benefits of that strategy. Alternatively stated, a strategy in and of itself means nothing. Enforcement and diagnoses of that strategy are critical if that strategy is going to actually work.
What say you?
In my last post, I discussed matching at a relatively high level. I concluded the post with a reference to software that uses some pretty sophisticated math to match based upon probabilities. Netrics was one such company and was recently acquired by Tibco.
In this post, I’d like to discuss what to do when you’re under the gun. That is, your organization is in the midst of some type of data cleanup project on a burning plank. Maybe your consultants didn’t tell you about the difficulties to be expected migrating data. In any event, you don’t have the time to evaluate different matching applications, talk to different vendors, test the application, and the like. You need to solve a problem now, if not sooner.
Note that this post assumes that matching by simple inner joins is not possible, an example of which is shown below. That is, there’s no common record among the tables that would allow for easy consolidation and matching.
In the figure above, tbl_Student.StudentID should tie directly to tbl_EligibilityStatus.EligabilityStatusID. A record in one table should match a record in the other (the operative word here being should.)
So, what’s a data management professional to do when we don’t have employee number, product ID, or customer number consistently populated among our different systems and tables? While hardly as powerful or accurate as the applications dedicated to solving these problems, it’s not impossible to manipulate data in such a way as to increase the probability of your matches. In the past, I’ve used SQL statements and other tricks to serve as the basis for my data matching.
Consider the following table:
Now, if social security number (SSN) were consistently defined and populated throughout our tables, systems and applications, matching would be a piece of cake. Lamentably, this is often not the case. As a result, to match, we have to get creative.
I could concatenate different fields as follows, creating what appears to be a unique value for each record (Concat):
||100 Main Street
||74 Rush Street
The problem with this type of concatenation, however, is that it assumes that every table, system, and application has all of these fields. this may be a faulty assumption. For instance, what if an applicant tracking system (ATS) does not include an employee’s job code? That’s certainly possible because the employee hasn’t been hired yet!
So, let’s revise our attempt to create a primary key to remove job code:
||100 Main Street
||74 Rush Street
This seems like a more reasonable approach, as many systems contain at least this basic data on employees. (Whether it’s the correct address, name, SSN, etc. is another matter.) John Smith might be listed as follows:
- John Smith (note the extra space between the first and last names, something that can be eliminated with TRIM statements)
- Johnathan Smith
- John P. Smith
- Jon Smith
I could go on, but you get my drift. This approach to data matching is by no means guaranteed. It’s at least a starting point and, depending on the data at my disposal, I’ve used it to obtain accuracy levels of approximately 90 percent. So, if I have 1,000 records that need to be matched, at least I’ve given my end users only 100 to look at and individually verify.
One caution with this method: You may wind up with a few false positives–i.e., records that the above method claims match but in fact do not.
Sometimes software isn’t an option and there are far too many records to match than we can handle manually. In these instances, we have to improvise. Use some of the techniques described in this post to expedite your matching with a fair degree of accuracy. Check a few of the matches to ensure that they make sense. While not perfect, methods like these can save time and money.
What say you?
Anyone who has ever worked with vast amounts of data has, at one point, had to deal with the issue of matching. In a nutshell, data from one system or system(s) needs to be fused with data from another system or system(s). Only a neophyte believes that this is an easy task with a data set of any decent size. Newbies ask, “Well, if we have 10,000 employees or products in one application, how hard can it be to link them to 10,000 in another?”
And then IT folks cringe.
In this post, I’d like to offer a primer on matching.
The business case for matching is typically straightforward. At some point and at a high level, organizations need to make sense of their data. To be sure, MDM tools exist that allow organizations to maintain different sets of data in different systems (facilitated by a master list). Yes, this can minimize the need for a complete data cleanup endeavor. But MDM does not eliminate the need for organizations to match their data. In fact, MDM may even enhance that need, forcing organizations to match different sets of data to one “master” list. Here, master records would serve as a bridge, crosswalk, or translate (XLAT) table to different data sets.
For instance, John Smith in an applicant tracking system (ATS) needs to be matched with John Smith in an ERP. We assume that both John Smith’s are one and the same but, as we’ll see in this post, that’s not always the case.
Types of Matching
Delving more into the rationale behind matching Henrik L. Sørensen recently writes on his well-trafficked blog that “matching party data – names and addresses – is the most common area where Data Matching is practiced.” He goes on to write that different types of matching include:
- Match with external reference data
- Identity Resolution
To continue with our example, five recruiters at Acme, Inc. might have entered John Smith five separate times in the ATS. This is a big no-no, but I’ve met many HR folks who don’t care too much about data quality and downstream impacts of superfluous transactions. In this case, John Smith needs equal parts deduplication and identify resolution.
Two Approaches: Exact vs Probabilistic Matching
We think that we ultimately need one John Smith throughout Acme’s systems, but how do we get there? If John Smith’s social security (123-45-6789) number is the same in both systems, then we can feel pretty confident that it’s the same person and we can remove or consolidate extraneous records. This is an exact match.
But what if we’re missing some data? What if a number is transposed? Let’s say that one record lists is SSN as 123-45-6798. Is this the same John Smith?
Here, exact matching fails us and we have to look at other means, especially if we’re dealing with tens of thousands of records and up. (In the case of one person, we can always ask him!) Here, we should turn to the second type of matching: probabilistic matching.
The best definition I’ve found on the topic comes from Information Management. It defines probabilistic matching as a method that
uses likelihood ratio theory to assign comparison outcomes to the correct, or more likely decision. This method leverages statistical theory and data analysis and, thus, can establish more accurate links between records with more complex typographical errors and error patterns than deterministic systems.
Translation: the law of large numbers is put to use to ascertain relationships that are very likely to exist. The results can be astounding. I’ve heard of tools that use math to produce results that are more than 99.9% accurate.
Benefits of Probabilistic Matching
The Mike 2.0 Offering shows that major benefit of probabilistic relative to exact matching is that the former employs fuzzy logic to match fields that are similar. As a result, it:
- Allows matching of records that have transposition or spelling errors and therefore obtain a significant increase in matches over systems using purely string comparisons.
- Standardizes data in free-form fields and across disparate data sources.
- Uncovers information buried in free-form fields and identifying relationships between data values.
- Provides survivorship of the best data within and between source systems.
- Does not require extensive programming to develop matches based on simple business requirements.
Software that matches people on probabilities will link John Smith (123-45-6798) with John Smith (123-45-6789). It’s the same guy.
Of course, the best way to match your data is to manage it properly from the get go. Even basic ATSs allow organizations to establish business rules that would minimize the likelihood of our two John Smiths. (Prohibiting applicants with the same SSN comes to mind.)
But minimize and eliminate are two different things. End users indifferent to data quality beat systems rife with business rules any day of the week–and twice on Sunday. A culture of data governance is the end-game, but you might need to employ some matching techniques to triage some data issues before you get to the Holy Grail.
What say you?
Yogi Berra famously once said, “When you come to a fork in the road, take it.” In this post, I’ll discuss a few different paths related to data quality during data migrations.
But let’s take a step back first. Extract-Transform-Load (ETL) represents a key process in any information management framework. At some point, at least some of an organization’s data will need to be taken from one system, data warehouse, or area, transformed or converted, and loaded into another data area.
The entire MIKE2.0 framework in large part hinges on data quality. DQ represents one of the methodology’s key offerings, if not its most important one. To this end, it’s hardly unique as an information management framework. But, as Gordon Hamilton and my friend Jim Harris pointed out recently on an episode of OCDQ Radio, not everyone is on the same page when it comes to when we should actually clean our data. Hamilton talks about EQTL (Extract-Quality-Transform-Load), a process in which data quality is improved in conjunction with an application’s business rules. (Note that there are those who believe that data should only be cleaned after it has been loaded into its ultimate destination–i.e., that ETL should give way to ELT.)
Why does this matter? Many reasons, but I’d argue that implicit in any decent information management framework is the notion of data quality. While many frameworks, models, and methodologies vary, it’s hard to imagine any one worth its salt that pooh-poohs accurate information. (For more on a different framework, see Robert Hillard’s recent post.)
Data Quality Questions
Different frameworks aside, consider the following questions:
- As data management professionals, should we be concerned about the quality of the data as we are transforming and loading it?
- Or can our concerns be suspended during this critical process?
- And the big question: Is DQ always important?
I would argue yes, although others may disagree with me. So, for the organization migrating data, when is the best time for conduct the cleansing process? Here are the usual suspects:
- During Migration
- Post Migration
Organizations can often save an enormous amount of time and money if they cleansed their data before moving it from point A to point B. To be sure, different applications call for different business rules, fields, and values. Regardless, it’s a safe bet that a new system will require different information than its predecessor. Why not get it right before mucking around in the new application?
Some contend that this is generally the ideal point to cleanse data. Retiring a legacy system typically means that organizations have decided to move on. (The term legacy system implies that it is no longer able to meet a business’ needs.) As such, why spend the time, money, and resources “fixing” the old data. In fact, I’ve seen many organizations specifically refuse to cleanse legacy data because their end users felt more secure with the knowledge that they could retrieve the old data if need be. So much for cutting the cord.
If your organization takes this approach, then it is left with two options: cleanse the data during or post-migration. If given the choice, I’d almost always opt for the former. It’s just easier to manipulate data in Excel, Access, ACL, or another tool than it is in a new system loaded with business rules.
The ease of manipulating data outside of a system is the very reason that many organizations prefer to clean their in a system of record. Excel doesn’t provide a sufficient audit trail for many folks concerned about lost, compromised, or duplicated data. As such, it makes more sense to clean data in the new system. Many applications support this by allowing for text-based fields that let users indicate the source of the change. For instance, you might see a conversion program with a field populated with “data change – conversion.”
Is this more time-consuming? Yes, but it provides greater audit capaability.
What say you?
More and more organizations are creating formal positions for the Chief Data Officer (CDO)–aka, the Data Czar. In this post, I’ll discuss the role and some of the challenges associated with it.
Background: The What and the How
The notion of a CDO is not exactly new. In fact, credit card and financial services company Capital One hired one back in 2003. (Disclaimer: I used to work for the company).
At a high level, the CDO is supposed to play a key executive leadership role in driving data strategy, architecture, and governance. S/he is the executive leader for data management activities. So, we know what the CDO supposed to do, but how exactly is s/he supposed to accomplish these often delicate tasks? It’s not a simple answer.
In a recent post for the Data Roundtable, Tom Redman writes:
…there is not yet a script for implementing the CDO. Organizations need to experiment to figure out how it will work best. I think this was and is the most important observation. And, very likely different organizations will adopt different directions based on the roles data play, to satisfy customers, conduct operations, make decisions, innovate, and so forth. One would expect different strategies to lead to different organizational forms.
Redman is right in pointing out the myriad waters that a CDO has to navigate. The CDO is by no means a simple job and touches many parts of the organization.
Who and Why?
Rooted in this question of “What does the CDO do?” are two related queries:
- To whom does the CDO report?
- Why is this person here?
In fact, some people openly question the basic need for a CDO. By way of contrast, an organization’s CFO or Chief People Officer (CPO) occupy more established roles, even if the latter’s title has changed. Does anyone really doubt the need for a head of HR and/or finance?
Few organizations place the CDO on par with the CTO or CIO. It’s not uncommon for the CDO to report to one of the other “C’s.” To this end, should one “C” report to another? Is this just title inflation?
I can see the need for an effective CDO at a large, multi-national organization lacking discipline with respect to its data management practices. You could certainly make the case for a Data Czar when an organization’s culture is miles away from meaningful data governance.
But it’s hard for me to imagine the CDO as a must-have role for every type or organization. Even a mid-market company might be hard-pressed to justify the expense. A well-run conglomerate can probably get by without one.
To be sure, the CDO has the ability (responsibility?) to look at things from a much more global perspective than busy line of business (LOB) users stuck in the trenches. And, at least to me, this is part of the fundamental challenge for the CDO: How do they manage the IT-business chasm? End users create, modify, and purge data. Data shouldn’t be the responsibility of “the Data Department” or even IT.
To be effective, CDOs certainly have their work cut out for them. They probably only exist in companies facing very complicated and significant data challenges. As such, they must strike the appropriate balance between:
- solving immediate data-related problems; and
- understanding high-level data and information management issues
By the same token, though, CDOs cannot be seen as the police if they want to be effective. People tend to dread calls and meetings with those critical of their practices–with no understanding of the business issues at play.
What say you?
I know a guy who works for a large retail organization (call it ABC here) in an information management capacity. Let’s call him James in this post. Over the last 12 years, James has ascended to a position of relative prominence at ABC. (He’s one of maybe ten assistant vice presidents.) He manages five people directly and is responsible for a number of contractors overseas. He can still build cubes, roll up his sleeves to solve vexing data-oriented issues, and talk the talk.
Beyond technical skills, James plays by his company’s rules and never rocks the boat. He listens very intently when his internal clients talk to him about their needs, frustrations, and suggestions. James is incredibly diplomatic and has rarely offended anyone, even during organizational crises. The word diligent is entirely appropriate to describe him. He’s a real asset to his company but he has to make a pretty big adjustment if he wants to make it to the next level.
Take a guess. I’ll wait.
James is too neutral.
The Problem with Neutrality
In any large organization, one is unlikely to be successful–or remain employed, for that matter–by being a complete maverick. (Small companies are often different, but let’s focus on larger ones in this post.) Big company management is rooted in a military model in which the following are valued:
- Following orders
- Playing by the rules
- Not questioning authority
From an information management perspective, this means writing reports requested by users, fixing data issues, gather requirements, and the like. Note that none of these activities remotely resembles leadership.
There’s obviously a larger point here about workplace rights and behavior. I can’t speak about other countries, but at least in the United States, the Constitution does not exist in the workplace–subject to a few limitations surrounding discrimination, whistle-blowing, and the like. For instance, the government grants you the right to say just about whatever you want, but your manager can fire you for doing so. Brass tacks: Speak your mind if you like, just be ready to pay the price.
But rank has always its privileges. The standard is not the same across the board and everyone is not equal. An entry-level AP clerk finds himself on a tighter leash than a CXO. Mid-level managers aren’t expected to set organizational directions and strategies. As you move up the corporate ladder, it is very difficult–if not impossible–to be neutral if you want to be keep ascending. You’re going to have to make some different judgment calls, even and especially when the data tell you different and even conflicting things.
For instance, let’s say that ABC is conflicted about ways build a database. Some people are reluctant to depart from traditional methods. They want to stay the course with row-based databases. Then there are others willing to embrace the unknown–i.e., move to a columnar database.
Consider the following:
- Is this a big decision? Yes.
- Is this bell hard to unring? Absolutley.
- Should James carefully consider each viewpoint? Of course.
- Should James ultimately have an opinion and a recommendation?
How can he not?
That’s not to say that everyone ought to disagree with everyone to superfluously flex muscle. Nor do I claim that the ability to reach compromise should be minimized. On the contrary, it’s incredibly valuable to see both sides of an issue and broker a truce.
But there’s only so much you can do when you’re neutral.
What say you?
If you’re in the technology or information management spaces, you’ve probably heard the axiom “Information wants to be free.” While accounts vary, many attribute the quote to Stewart Brand in the 1960s.
Today, information is becoming increasingly more prevalent and less expensive, spawning concepts such as Open Data, defined as:
the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control. While not identical, open data has a similar ethos to those of other “Open” movements such as open source, open content, and open access. The philosophy behind open data has been long established (for example in the Mertonian tradition of science), but the term “open data” itself is recent, gaining popularity with the rise of the Internet and World Wide Web and, especially, with the launch of open-data government initiatives such as Data.gov.
Again, you may already know this. But consider what companies such as Facebook are doing with their infrastructure and server specs–the very definition of traditionally closed apparata. The company has launched a project, named OpenCompute, that turns the proprietary data storage models of Amazon and Google on their heads. From the site:
We started a project at Facebook a little over a year ago with a pretty big goal: to build one of the most efficient computing infrastructures at the lowest possible cost.
We decided to honor our hacker roots and challenge convention by custom designing and building our software, servers and data centers from the ground up.
The result is a data center full of vanity free servers which is 38% more efficient and 24% less expensive to build and run than other state-of-the-art data centers.
But we didn’t want to keep it all for ourselves. Instead, we decided to collaborate with the entire industry and create the Open Compute Project, to share these technologies as they evolve.
In a word, wow.
Both of these projects reveal complex dynamics at play. On one hand, there’s no doubt that more, better, and quicker innovation results from open source endeavors. Crowdsourced projects benefit greatly from vibrant developer communities. In turn, these volunteer armies create fascinating extensions, complementary projects, and new directions for existing applications and services.
I could cite many examples, but perhaps the most interesting is WordPress. Its community is nothing less than amazing–and the number of themes, extensions, and plug-ins grows daily, if not hourly. The vast majority of its useful tools are free or nearly free. And development begets more development, creating a network effect. Millions of small businesses and solopreneurs make their livings via WordPress in one form or fashion.
On the other hand, there is such a thing as too open–and WordPress may be an equally apropos example here. Because the software is available to all of the world, it’s easier for hackers to launch Trojan horses, worms, DoS attacks, and malware aimed at popular WordPress sites. To be sure, determined hackers can bring down just about any site (WordPress or not), but when they have the keys to the castle, it’s not exactly hard for them to wreak havoc.
Does Facebook potentially gain by publishing the design of its data centers and servers for all to see? Of course. But the risks are substantial. I can’t tell you that those risks are or are not worth the rewards. I just don’t have access of all of the data.
But I certainly wouldn’t feel comfortable doing as much if I ran IT for a healthcare organization or an e-commerce company. Imagine a cauldron of hackers licking their lips at the trove of stealing highly personal and valuable information surely used for unsavory purposes.
When considering how open to be, look at the finances of your organization. Non-profits and startups might find that the squeeze of erring on the side of openness is worth the juice. For established companies with very sensitive data and a great deal to lose, however, it’s probably not wise to be too open.
What say you?
TODAY: Tue, May 21, 2013May2013