Posts Tagged ‘Hadoop’
Big Data requires million-dollar investments.
Nonsense. That notion is just plain wrong. Long gone are the days in which organizations need to purchase expensive hardware and software, hire consultants, and then three years later start to use it. Sure, you can still go on-premise, but for many companies cloud computing, open source tools like Hadoop, and SaaS have changed the game.
But let’s drill down a bit. How can an organization get going with Big Data quickly and inexpensively? The short answer is, of course, that it depends. But here are three trends and technologies driving the diverse state of Big Data adoption.
Crowdsourcing and Gamification
Consider Kaggle. Founded in April 2010 by Anthony Goldbloom and Jeremy Howard, the company seeks to make data science a sport, and an affordable one at that. Kaggle is equal parts rowdsourcing company, social network, wiki, gamification site, and job board (like Monster or Dice).
Kaggle is a mesmerizing amalgam of a company, one that in many ways defies business convention. Anyone can post a data project by selecting an industry, type (public or private), type of participation (team or individual), reward amount, and timetable.” Kaggle lets you easily put data scientists to work for you, and renting is much less expensive than buying them.
Open Source Applications
But that’s just one way to do Big Data in a relatively inexpensive manner–at least compared to building everything from scratch and hiring a slew of data scientists. As I wrote in Too Big to Ignore, digital advertising company Quantcast attacked Big Data in a very different way, forking the Hadoop file system. This required a much larger financial commitment than just running contest on Kaggle.
The common thread: Quantcast’s valuation is nowhere near that of Facebook, Twitter, et al. The company employs dozens of people–not thousands.
Finally, even large organizations with billion-dollar budgets can save a great deal of money on the Big Data front. Consider NASA, nowhere close to anyone’s definition of small. NASA embraces open innovation, running contests on Innocentive to find low-cost solutions to thorny data issues. NASA often prizes in the thousands of dollars, receiving suggestions and solutions from all over the globe.
I’ve said this many times. There’s no one “right” way to do Big Data. Budgets, current employee skills, timeframes, privacy and regulatory concerns, and other factors should drive an organization’s direction and choice of technologies.
What say you?
In Too Big to Ignore, I wrote about the increasing importance of technologies and systems designed to handle non-relational data. Yes, the structured information on employees, sales, customers, inventory, and the like still matter. But the story doesn’t end with Small Data. There’s a great deal of value to be gleaned from the petabytes of unstructured data lying outside of organizations’ walls. Hadoop is just one tool that can help realize that value.
But no one ever said that Hadoop was perfect or even ideal. The first major iteration of any important technology or application never is.
To that end, data Geeks like me could hardly contain their excitement with the announcement that Hadoop 2.0 is now generally available.
The biggest change to Apache Hadoop 2.2.0, the first generally available version of the 2._x_ series, is the update to the MapReduce framework to Apache YARN, also known as MapReduce 2.0. MapReduce is a big feature in Hadoop—the batch processor that lines up search jobs that go into the Hadoop distributed file system (HDFS) to pull out useful information. In the previous version of MapReduce, jobs could only be done one at a time, in batches, because that’s how the Java-based MapReduce tool worked.
With the available update, MapReduce 2.0 will enable multiple search tools to hit the data within the HDFS storage system at the same time.
Hadoop and Platforms
I asked my friend Scott Kahler about Hadoop 2.0 and he was nothing short of effusive. “Yes, it’s huge deal. YARN will make Hadoop a distributed app platform and not just a Big-Data processing engine”, Kahler told me. “YARN is enabling things like graph databases (Giraph) and event processing engines (Storm) to get instantiated much easier on common distributed system infrastructure.”
I know a thing or two about platforms, and Hadoop 2.0 underscores the fact that it is becoming a de facto ecosystem for Big Data developers across the globe. Got an idea for a new app or web service? Build it on top of Hadoop. Take the core product in a different direction. If others find that app or web service useful, expect further development on top of your work.
Simon Says: We’re Just Getting Started
Hadoop naysayers abound. For all I know, Hadoop isn’t the single best way of handling Big Data. Still, it’s hard to argue that the increased functionality of its second major iteration isn’t a big deal. As it continues to evolve and improve, the benefits begin to exceed its costs.
Yes, many if not most organizations will still resist Big Data for all sorts of reasons. An increasingly developer-friendly Hadoop, though, means great things for enterprises willing to jump into the Big Data fray.
What say you?
“I saw the angel in the marble and carved until I set him free.”
The era of Big Data has arrived, yet relatively few organizations seem to recognize it. Platitudes from CXOs are all fine and dandy, but how many have invested in Hadoop or hired a data scientist? Not too many, in my view. (See “Much Hadoop About Nothing.“)
Brass tacks: The hype around Big Data today is much greater than the reality–and it probably will be for some time.
This is unfortunate, as many organizations already have within their walls very valuable data that could be turned into information and knowledge with the right tools. Because of their unwillingness to adopt more contemporary Big Data and dataviz applications, though, that knowledge effectively hides in plain sight. The ROI question still paralyzes many CXOs afraid to jump into the abyss.
I know something about the notion of hiding in plain sight. It is one of the major themes of my favorite TV show, Breaking Bad.
To some extent, I understand the reluctance surrounding Hadoop. After all, it’s a fundamentally different way of thinking about data, modeling, and schema. Most IT professionals are used to thinking about data in orderly and relational terms, with tables and JOIN statements. Those will the skills to work with this type of data are in short supply, at least for the time being. “Growing” data scientists and a new breed of IT professionals doesn’t happen overnight. The same thing happens with lawyers and doctors.
Overcoming stasis isn’t easy, especially in budget-conscious, risk-averse organizations. To that end, here are a few tips on getting started with Big Data:
- Don’t try to boil the ocean. Small wins can be huge, to paraphrase from the excellent book The Power of Habit: Why We Do What We Do in Life and Business.
- Communicate successes. Getting people to come to you is much easier than forcing them. The carrot is more effective than the stick.
- Under-promise and over-deliver.
Compared to a year ago, I have seen progress with respect to Big Data adoption. Increasingly, intelligent people and companies are doing more with new forms of data—and getting more out of it. As a result, data visualization has become a big deal. To paraphrase Michelangelo, they are starting to set the data free.
What say you?
In his recent post, my friend Jim Harris contrasts data democracies with data dictatorships. Harris writes that most organizations are data democracies, “which means that other data sources, both internal and external, will be used.” Truer words have never been written.
Of course, this hasn’t always been the case. Two decades ago, data management wasn’t nearly as democratic and chaotic as it is today. Many organizations could control a good portion of the data under their umbrellas–or at least try.
Enter the Internet.
Today, plenty of valuable data still comes from employees and customers. But they’re hardly alone. Data also emanates from partners, users, machines, websites, social networks, feeds, and many other sources as well. Enterprises today need to “manage” and aggregate data from myriad places. What’s more, no longer do organizations have to concern themselves with only structured data streaming at them once in a while. Today, they have to deal with multiple data types (read: structured and semi-structured), and at an increasing velocity to boot.
Yes, the tools today are much more powerful and dynamic than even ten years ago. Hadoop represents the de facto Big Data standard, but plenty of NoSQL, NewSQL, and other Big Data solutions exist.
The point is that organizations would do well not to think of data management as either-or/two poles: democracy vs. dictatorship. Rather, more than ever, these two extremes are part of the same continuum. (See below.)
I cannot think of an enterprise with completely democratic or dictatorial data management. Today, the most intelligent organizations incorporate both democratic and dictatorial elements into their information management.
Dictatorship good: Letting employees set their own salaries isn’t terribly wise.
Democracy good: Many companies let employees handle their own benefits via open enrollment. This is very smart. In the social sphere, only paying attention to company-initiated tweets or blog posts is ill advised. Companies that ignore consumer complaints on social networks do so at their own peril. Just ask United Airlines.
I haven’t been to too many meetings in which senior folks have openly asked, “How democratic or dictatorial should we be?” Not asking the question every so often, however, almost guarantees organizations ignore potentially critical information. Democratic data like user-generated photos, blog posts, comments, videos, and podcasts is exploding.
Just because an organization cannot control that data doesn’t mean that that it should disregard it. Embrace a hybrid strategy. Democracy and dictatorships each have their place.
What say you?
Think for a minute about how much we spend on healthcare. In the United States, the numbers break down as follows:
- Roughly $3 trillion spent annually, a number rising at 6-7% per year
- This represents about 17% of US Gross Domestic Product (GDP)
- Some estimates put the number wasted annually on healthcare at a mind-boggling $2 trillion
- There’s at least $60B in annual Medicare fraud alone each year (and some estimates put that number at $250B) in fraud
For more astonishing data on healthcare, click here. The stats are frightening. With so much waste and opportunity, it should be no surprise that quite a few software vendors are focusing on Big Data–and not just behemoths like IBM. Start-ups like Explorys, Humedica, Apixio, and scores of others have entered the space.
Where’s the Data?
With so much action surrounding Big Data and healthcare, you’d think that there would be a tremendous number of examples. You’d expect there to be more statistics on how Big Data has helped organizations save lives, reduce costs, and increase revenue.
And you’d be wrong.
I’ve worked in hospitals a great deal over my career, and the term risk aversion is entirely apropos. Forget for a minute the significant difficulty in isolating cause and effect. (It’s not easy to accurately claim that deploying Hadoop throughout the organization saved 187 lives in 2012.)
Say for a minute that you’re the CIO of a healthcare organization and you make such a claim. Think about the potential ramifications from lawsuit-happy attorneys. Imagine having to respond to inquiries from lawyers about why you waited so long to deploy software that would have saved so many lives. What were you waiting for? How much will you pay my clients to drop their suit?
This isn’t to say that you can’t find data on, well, Big Data and healthcare. You can. You just have to look really hard–and you’ll more than likely be less than satisfied with the results. For example, this Humedica case study shows increased diagnosis of patients with diabetes who fell between the cracks.
Large organizations are conservative by their nature. Toss in potential lawsuits and it’s easy to understand the paucity results-oriented Big Data healthcare studies. What’s more, we’re still in the early innnings. Expect more data on Big Data in healthcare over the coming years.
What say you?
Half the money I spend on advertising is wasted; the trouble is I don’t know which half.
Executive turnover has always fascinated me, especially as of late. HP’s CEO Leo Apotheker had a very short run and Yahoo! has been a veritable merry-go-round over the last five years. Beyond the CEO level, though, many executive tenures resemble those of Spinal Tap drummers. For instance, CMOs have notoriously short lifespans. While the average tenure of a CMO has increased from 23.6 to 43 months since 2004, it’s still not really a long-term position. And I wonder if Big Data can change that.
In a recent article for Chief Marketer, Wilson Raj the global customer intelligence director of SAS, writes about the potential impact of Big Data and CMOs. From the piece:
CMOs today are better poised than ever not only to retain their roles, but to deliver broad, sustainable business impact. CMOs who capitalize on big data will reap big rewards, both personally and professionally. Bottom line: Businesses that exploit big data outperform their competition.
Necessary vs. Sufficient Conditions
The potential of Big Data is massive. To realize it to an optimal level, however, organizations need to effectively integrate transactional and analytical data and systems. Lamentably, many organizations are nowhere close to being able to do this. That is, for every Quantcast, Amazon, Target, and Wal-Mart, I suspect that dozens or even hundreds of organizations continue to struggle with what should be fairly standard blocking and tackling. Data silos continue to plague many if not most mature organizations.
Utilizing Big Data in any meaningful way involves a great deal more than merely understanding its importance. Big Data requires deploying new solutions like NoSQL databases, Hadoop, Cassandra, and others. Only then will CMOs be able to determine the true ROI of their marketing efforts. That is, accessing and analyzing enterprise and external (read: social) information guarantees nothing. A CMO will not necessarily be able to move the needle just because s/he has superior data. (Microsoft may have all of the data in the world, but so what? Bing hasn’t made too many inroads in the search business and Surface isn’t displacing the iPad anytime soon.)
Think of access to information as a necessary but insufficient condition to ensure success. As I look five and ten years out, I see fewer and fewer CMOs being able to survive on hunches and standard campaigns. The world is just moving too fast and what worked six months ago may very well not work today.
Some believe that Big Data represents the future of marketing. I for one believe that Big Data and related analytics can equip organizations with extremely valuable and previously unavailable information. And, with that information, they will make better decisions. Finally marketers will be able to see what’s really actually going on with their campaigns. Perhaps problems like the one mentioned at the beginning of this post can finally be solved.
What say you?
If you haven’t heard of data science, you will soon. As organizations realize that Big Data isn’t going away, they will finally come around. This is always the case with the technology adoption lifecycle. Yes, this may very well mean new hardware purchases and upgrades, as well as new software solutions like Hadoop and NoSQL. At some point, however, employees with new skills will have to make all of this new stuff sing and dance.
Enter the Data Scientist
Part statistician, part coder, part data modeler, and part business person, the term has grown in importance since being introduced in 2008. But I’d argue that the single most important attribute of the data scientist is a childlike curiosity of why things happen–or don’t.
In their 2010 HBR piece “Data Scientist: The Sexiest Job of the 21st Century“, Thomas H. Davenport and D.J. Patil write about the key characteristics of these folks. From the piece:
But we would say the dominant trait among data scientists is an intense curiosity—a desire to go beneath the surface of a problem, find the questions at its heart, and distill them into a very clear set of hypotheses that can be tested. This often entails the associative thinking that characterizes the most creative scientists in any field. For example, we know of a data scientist studying a fraud problem who realized that it was analogous to a type of DNA sequencing problem. By bringing together those disparate worlds, he and his team were able to craft a solution that dramatically reduced fraud losses.
Think about what a real scientist does for a moment. Whether trying to invent a drug or cure a disease, scientists look at data, form hypothesis, test them, more often than not fail, reevaluate, and refine. Many problems remain unsolved even after years of analysis. Louis Pasteur didn’t create a vaccine for rabies and anthrax over a weekend. Science is not a linear process.
I’d argue that the same thing applies to data science. Detecting patterns in datasets in the petabytes much different than writing a simple SELECT statement or doing ETL. Big Data means plenty of iterations and failures. Understanding why sales are slipping or customer behavior in general are not simple endeavors–nor are they static. That is, factors motivating people to buy products and services will probably change over time. Paying a data scientist a king’s ransom may come with the expectations of immediate, profound insights. That may well be the case, but it’s also entirely plausible that progress will take a great deal of time, especially at the beginning.
As Jill Dyche points out on HBR, there’s rarely a eureka moment. If your organizational culture does not permit failure and insists upon immediate results, maybe hiring a data scientist isn’t wise. Your organization should save some money and revisit Big Data and data science in five years. That is, if your organizational is still around.
What say you?
Nate Silver is a really smart guy. Not even 40, he’s already developed tools to analyze statistics ultimately bought by Major League Baseball. He writes about politics and stats for the New York Times. And, at the Data Science Summit here in Las Vegas, I saw him speak about Big Data. (You can watch the video of that talk here.)
DSS was all about Big Data, a topic about which I’m not terribly familiar. (It is a fairly new term, after all.) I suspected that Silver would be speaking over my head. But a funny thing happened when I watched his speak: I was able to follow most of what he was saying.
By way of background, I’m a numbers guy and have been for a very long time. Whether in sports or work, I like statistics. I like data. I remember most of the material from my different probability and stat courses in college. I understand terms like p values, statistical significance, and confidence intervals. This isn’t that difficult to understand; it’s not like I’m a full-time poet or ballet dancer.
Wide, Not Long
The interesting thing about new data sources and streams is that some of the old tools just don’t cut it. Consider the relational database. Data from CRM or ERP applications data fit nicely into transactional tables linked by primary and foreign keys. On many levels, however, social media is far from CRM and ERP–and this has profound ramifications for data management. Case in point: Twitter runs on Scala, not a relational database. Why? To make a long story short, the type of data generated and accessed by Twitter just can’t run fast enough in a traditional RMDB architecture. This type of data is wide, not long. Think columns, not rows.
To this end, Big Data requires some different tools and Scala is just one of many. Throw Hadoop in there as well. But our need to use “some different tools” does not mean that tried and true statistical principles fall by the wayside. On the contrary, old stalwarts like Bayes’ Theorem are still as relevant as ever, as Silver pointed out during his speech.
Simon Says: The Best of Both Worlds
In an era of Big Data, there will be winners and losers. Organizations that see success will be the ones that combine the best of the old with the best of the new. Retrofitting unstructured or semistructured data into RMDBs won’t cut it, nor will ignoring still-relevant statistical techniques. Use the best of the old and the best of the new to succeed in the Big Data world.
What say you?
A few years ago and while its stock was still sky-high, Netflix ran an innovative contest with the intent of improving its movie recommendation algorithm. Ultimately, a small team figured out a way for the company to significantly increase the accuracy with which it gently suggests movies to its customers.
It turns out that these types of data analysis and improvement contests are starting to catch on. Indeed, with the rise of Big Data, cloud computing, open source software, and collaborative commerce, it has never been easier to outsource these “data science projects.”
From a recent BusinessWeek article:
In April 2010, Anthony Goldbloom, an Australian economist, [f]ounded a company called Kaggle to help businesses of any size run Netflix-style competitions. The customer supplies a data set, tells Kaggle the question it wants answered, and decides how much prize money it’s willing to put up. Kaggle shapes these inputs into a contest for the data-crunching hordes. To date, about 25,000 people—including thousands of PhDs—have flocked to Kaggle to compete in dozens of contests backed by Ford (F), Deloitte, Microsoft (MSFT), and other companies. The interest convinced investors, including PayPal co-founder Max Levchin, Google Chief Economist Hal Varian, and Web 2.0 kingpin Yuri Milner, to put $11 million into the company in November.
The potential for these types of projects is hard to overstate. Ditto the benefits.
Think about it. Organizations can publish even extremely large data sets online for the world at large. Interested groups, companies, and even individuals can use powerful tools such as Hadoop to analyze the information and provide recommendations. In the process, these insights can lead to developing new products and services and dramatic enhancements in existing businesses process (see Netflix).
Of course, these organizations will have to offer some type of prize or incentive. Building a better mousetrap may be exciting, but don’t expect too many people to volunteer their time without the expectation of significant reward. Remember that, of the millions of people who visit Wikipedia every day, only a very small percentage of them actually does any editing. If Wikipedia (a non-profit) offered actual remuneration, that number would be significantly higher (although the quality of its edits would probably suffer).
Consider the following examples:
- A pharmaceutical company has a raft of data on a new and potentially promising new drug.
- A manufacturing company has years of historical data on its defects.
- A retailer is trying to understand its customer churn but can’t seem to get its arms around its data.
I could go on, but you get my drift.
While there will always be the need for proprietary data and attendant analysis, we may be entering an era of data democratization. Open Data is here to stay and I can certainly see the growth of marketplaces and companies like Kaggle that match data analysis firms with companies in need of that very type of expertise.
Of course, this need has always existed, but unprecedented power of contemporary tools, technologies, methodologies, and data mean that outsourced analysis and contests have never been easier. No longer do you have to look down the hall, call IT, or call in a Big Four consulting firm to understand your data–and learn from it.
What say you?
You’re may laugh at this statistic
As of the end of September, AOL still had 3.5 million subscribers to its dialup Internet access service. A recent BusinessInsider article points out that:
According to AOL’s earnings release, the “average paid tenure” of its subscribers was about 10.6 years in Q3, up from about 9.4 years last year. (Of course, some of AOL’s existing access subscribers might not even realize they’re still paying for it.)
But for those who don’t have access to broadband, don’t want it, or don’t need it, AOL is still better than no Internet access.
I read this article and, once I got past the fact that more than 3 million people still hear those annoying dial-up sounds, couldn’t help but wonder about the inherent data limits of using such a datedtechnology . Note that I am well aware that for years AOL has offered broadband services. Yet, the company’s dial-up customers represent the very definition of Small Data.
Now, 3.5 million people can generate an awful lot of data, but dial-up is not making a comeback anytime soon. It’s hard for me to envision a future in which 52k/s Internet connection speeds rule the world. (Can you imagine a 15-year old kid waiting two minutes for a page to load?) And AOL head Tim Armstrong agrees, as evinced by his hyperlocal news strategy. Armstrong believes that the future of the Internet is all about content. (Shhh…don’t tell anyone.)
A Tale of Two Data Approaches
When you’re connected to the Internet at a fraction of the speed of the rest of the world, you’re going to generate a fraction as much data as those flying around at 20mb/s. Pages take longer to render, transactions don’t process as quickly, and people become more frustrated and give up doing whatever they were going. And you’ll lose customers–and all of the data that goes with them. The best that you can hope for is (relatively) Small Data.
Contrast Small Data with Big Data. The latter is so, well, big that it cannot be handled with traditional data management tools. Hence the need for Scala and Hadoop. The latter is a project with the intent of developing open-source software for reliable, scalable, distributed computing. Now, big data can be a big challenge, but it’s kind of like having to pay “too much” in taxes because you make a great deal of money. Would you rather not have to pay any taxes because you are impoverished?
AOL’s problems are well beyond the scope of an individual post, but suffice it to say for now that any organization that embraces Small Data faces considerable limits on its growth, revenue, and ultimately profits. The larger point is that, if you’re a small company with a fairly conservative reach and ambition, than Small Data may well be the way to go.
However, if you’re a burgeoning enterprise that plans on growing in all sorts of unexpected directions, get away from the Small Data mind-set. Of course, this isn’t just about adopting Big Data tools like Hadoop. Rather, in all likelihood, the products, services, and overall business strategy should require more powerful tools than relational databases. If you’re thinking big, don’t let Small Data constrain you.
What say you?
TODAY: Tue, March 28, 2017March2017