Large Data Set Analysis in the Cloud: Hadoop gets a boost

09.04.2009

When it comes time to mine the data, the programmer can avoid all details of how the data is laid out. A single function is used to organize the overall data set by reading through it and outputting an aggregation organized as key/value pairs. This is known as a map function. A second function-known as the reduce function -then goes through the aggregation output by the map function and selects the desired data, outputting it to a temporary file, organizing it in a table in memory, or even putting it into a data mart to be analyzed with traditional BI tools.

The advantage of this approach is that very large sets of data can be managed and processed in parallel across the machine pool managed by Hadoop. The map/reduce approach is sometimes criticized for being inefficient, since the overall data pool is processed each time a new analysis is desired. While it's true that repeated processing is typical, it's also true that in today's world it's impossible to know beforehand what kinds of analyses are going to be desired in the future. This means that optimizing to reduce the "inefficient" repeated processing would also limit the potential for exploring unforeseen ongoing analytical requirements. Moreover, processing is significantly cheaper than data, relatively speaking. A general rule of thumb is to optimize for the least efficient, most costly resource; with Internet-scale data, that means "wasting" processing to optimize storage.

While Hadoop may seem like a product you have no need for, don't dismiss it. The changing nature of IT, as well as the rapid evolution of business processes, means that you'll likely face the need for this kind of analytical tool in the very near future.

The power of Hadoop can be seen in how the used it to convert a 4 Tb collection of its pages from one format to another. The programmer assigned the task uploaded the data to a number of EC2 instances and then ran a Hadoop map/reduce transformation on the pages. Two days later, all of the pages had been converted, speeded up by the parallel processing across 20 machines made possible by Hadoop. Attempting to do the same conversion on one machine would have taken well over a month; attempting to perform the parallel processing without Hadoop would have necessitated creation of a large, complex grid fabric. Hadoop abstracted all the "plumbing" (i.e., spreading the data across machines, coordinating the parallel processing, ensuring that the job was executing properly, etc.) away from the programmer, enabling him to focus on the actual task: creating the document conversion routine executed in the reduce phase of the process.

Hadoop has attracted vendor attention. A new startup, Cloudera, offers certified releases and regular updates, as well as technical support. (Incidentally, Cloudera offers an set of , which are excellent). And just last week, Amazon announced it will offer Hadoop support in its , making Hadoop available in the cloud.