Large Data Set Analysis in the Cloud: Hadoop gets a boost

09.04.2009
Traditional business intelligence solutions can't scale to the degree necessary in today's data environment. One solution getting a lot of attention recently: Hadoop, an open-source product inspired by 's search architecture. Twenty years ago, most companies' data came from fundamental transaction systems: Payroll, ERP, and so on. The amounts of data seemed large, but usually were bounded by well-understood limitations: the overall growth of the company and the growth of the general economy. For those companies that wanted to gain more insight from those systems' data, the related data warehousing systems reflected the underlying systems' structure: regular data schema, smooth growth, well-understood analysis needs. The typical business intelligence constraint was the amount of processing power that could be applied. Consequently, a great deal of effort went into the data design to restrict the amount of processing required to the available processing power. This led to the now time-honored business intelligence data warehouses: fact tables, dimension tables, star schemas.

Today, the nature of business intelligence is totally changed. Computing is far more widespread throughout the enterprise, leading to many more systems generating data. Companies are on the Internet, generating huge torrents of unstructured data: searches, clickstreams, interactions, and the like. And it's much harder-if not impossible-to forecast what kinds of analytics a company might want to pursue.

Today it might be clickstream patterns through the company website. Tomorrow it might be cross-correlating external blog postings with order patterns. The day after it might be something completely different. And the system bottleneck has shifted. While in the past the problem was how much processing power was available, today the problem is how much data needs to be analyzed. At Internet-scale, a company might be dealing with dozens or hundreds of terabytes. At that size, the number of drives required to hold the data guarantees frequent drive failures. And attempting to centralize the data imposes too much network traffic to conveniently migrate data to processors.

One thing is clear: the traditional business intelligence solutions can't scale to the degree necessary in today's data environment.

Fortunately, several solutions have been developed. One, in particular, has gotten a lot of attention recently: Hadoop. Essentially, Hadoop is an open source product inspired by Google's search architecture. Interestingly, unlike previous open source products that were usually implementations of previously-existing proprietary products, Hadoop has no proprietary predecessor. The innovation in this aspect of big data resides in the open source community, not in a private company.

Hadoop creates a pool of computers, each with a special Hadoop file system. A central master Hadoop node spreads data across each machine in a file structure designed for large block data reads and writes. It uses a clever hash algorithm to cluster data elements that are similar, making processing data sets extremely efficient. For robustness, three copies of all data is kept to ensure that hardware failures do not halt processing.