Hadoop and the Enterprise Debates
From MIKE2.0 Methodology
-> You are here: Hadoop and the Enterprise Debates
Hadoop is one of the “big 3” topics that must come up – either by default or from my questioning – in every analyst briefing I’m on. The other 2, for the record, are cloud and mobile (if applicable). I hardly have to ask about the Hadoop strategy. Vendors are flocking to not get left behind.
Meanwhile, in shops, Hadoop projects are still skunkworks. This is not to say they are not being successful in what they are setting out to accomplish. Many are. However, keeping the mystery technology away from the prying eyes of IT avoids the debates:
1. Relational databases are getting better every year and will eventually have the capability to do everything Hadoop does
2. We have no staff that knows Hadoop
3. You can’t run SQL against Hadoop
4. I thought we solved “big data” with our data warehouse
5. Most of the weblogs and other unstructured data you would store in Hadoop are not very valuable
6. We run a real-time business and you cannot run jobs in real-time in Hadoop
I’ll address each of these points here to demonstrate that Hadoop does have a place and, despite the hype, remains underrated.
1. While it’s true that relational databases are getting better every year, it’s also true that most of what is put into Hadoop could, forcefully, be done in a relational database today. Some relational data warehouses are over a petabyte. It’s just that it would be very expensive to do it that way. For the most part, the unstructured data that shops finally decide to put into Hadoop is not under management today except in small pockets.
2. Investment in training selective staff, utilizing consultants who already know Hadoop and strategists who can effectively architect the seams between Hadoop and the rest of the environment will be necessary. Any new technology must clear this hurdle.
3. It’s true, but Pig is SQL-like and should not be too difficult to pick up for an experienced SQL person. If the argument is that the query languages are not as rich as SQL, that is going to be true, but Hadoop workloads are currently more data-intensive than query-intensive. Plus, these tools are aggressively being developed.
3a. For a SQL interface to Hadoop, take a look at the Apache Hive project. Hive comes pre-installed in the Cloudera packages and is a very popular way to access your HDFS.
4. We solved the “big data” of the time with data warehousing, but data only got bigger and bigger with so many more streams of data. Hadoop is usually not a valid cost-competitive replacement for alphanumeric big data. The unstructured data we’ve been leaving behind is quite big too and that’s what Hadoop is for.
5. Every piece of data, if harnessed, adds value to organization knowledge. Given the size of this data, the value per capita is going to be far less than the alphanumeric data in the relational data warehouse. However, if there’s positive return on investment and competitive parity or advantage to be had from managing the unstructured data, that too must be done.
6. This argument is completely valid. Jobs against Hadoop data are almost exclusively batch. Even the smallest job will take several minutes. By nature of the typical job profile, running against massive amounts of data, both in terms of transactions and the accompanying customer and other master data, jobs will take a long time. These jobs would take a long time in SQL/relational database as well.
One is integration with master data management data. Eventually richer customer and other master data profiles will need to be integrated with the Hadoop data in order to perform richer analysis. MDM data, at much smaller volume than transactional data, is enterprise-built and much richer than any profile information that could be derived from only Hadoop data.
Another integration point will be with the data warehouse itself. While naturally we will not port all the information from Hadoop into the data warehouse (or we would have just stored it there to begin with), summary data from Hadoop will be interesting to have in the data warehouse to take advantage of the tools, the other data, and the reporting infrastructure there. Hadoop is doing a lot of cleansing, processing and aggregating data for a data warehouse.
Help is on the way. The aforementioned vendor flock now includes Informatica. The 9.1 version of their platform will have a connector to the Hadoop file system (HDFS). This will allow the customers of this leading data integration tool to move data in and out of Hadoop clusters and utilize their knowledge of Information to effective with Hadoop.
Another vendor, Pentaho, an open source business intelligence vendor, is integrating their open source ETL tool with Hadoop.
I’ll also mention Talend, an open source integration company, whose vision is to be able to move any data to HDFS. They are connecting their 450+ set of connectors to Hadoop. Talend Hadoop jobs can be run inside or outside of a Hadoop VM with Talend’s “distance run” feature.
There is also a burgeoning third-party enterprise Hadoop tools marketplace opening up that will only improve the interactivity between Hadoop and the rest of the enterprise and the ability of organizations to more comfortably take up these projects and minimize the debates.
Wiki asset search