What Hadoop can, and can't do

14.06.2012

But this does not mean that Hadoop should replace existing elements within your data center. On the contrary, Hadoop should be integrated within your existing IT infrastructure in order to capitalize on the myriad pieces of data that flows into your organization.

Consider, for instance, a fairly typical non-Hadoop enterprise web site that handles commercial transactions. According to Sarah Sproehnle, Director of Educational Services for , the logs from one of their customer's popular sites would undergo an extract, transform, and load (ETL) procedure on a nightly run that could take up to three hours before depositing the data in a data warehouse. At which time, a stored procedure would be kicked off and (after another two hours) the cleansed data would reside in the data warehouse. The final data set, though, would only be a fifth of its original size -- meaning that if there was any value to be gleaned from the entire original data set, it would be lost.

After Hadoop was integrated into this organization, things improved dramatically in terms of time and effort. Instead of undergoing an ETL operation, the log data from the web servers was sent straight to the HDFS within Hadoop in its entirety. From there, the same cleansing procedure was performed on the log data, only now using MapReduce jobs. Once cleaned, the data was then sent to the data warehouse. But the operation was much faster, thanks to the removal of the ETL step and the speed of the MapReduce operation. And, all of the data was still being held within Hadoop -- ready for any additional questions the site's operators might come up with later.

This is a critical point to understand about Hadoop: it should never be thought of as a replacement for your existing infrastructure, but rather as a tool to augment your data management and storage capabilities. Using tools like , which can pull data from RDBMS to Hadoop and back; or , which can extract system logs in real time to Hadoop, you can connect your existing systems with Hadoop and have your data processed no matter the size. All you need to do is add nodes to Hadoop to handle the storage and the processing.

Required hardware, and costs