What Hadoop can, and can't do

14.06.2012

Sproehnle also outlined a fairly easy to follow rule-of-thumb for planning your Hadoop capacity. Because Hadoop is linearly scalable, you will increase your storage processing power whenever you add a node. That makes planning straightforward.

If your data is growing by 1 TB a month, for instance, then here's how to plan: Hadoop replicates data three times, so you will need 3 TB of raw storage space to accommodate the new TB. Allowing a little extra space (Sproehnle estimates 30 percent overhead) for processing operations of data, that puts the actual need at 4TB that month. If you're using 4 X 1 TB drive machines for your nodes, that's 1 new node per month.

The nice thing is that all new nodes are immediately put to use when connected, getting you times the processing and storage, where is the number of nodes.

Installing and managing Hadoop nodes is not exactly non-trivial, though, but there are many tools out there that can help. Cloudera Manager, (which is what Hortonworks uses for its management system), and the Control System are all equally effective Hadoop cluster managers. If you are using a "pure" Apache Hadoop solution, you can also look at , , and a third-party Hadoop management systems.

This is just the tip of the iceberg, of course, when it comes to deploying a Hadoop solution for your organization. Perhaps the biggest take-away is understanding that Hadoop is not meant to replace your current data infrastructure, only augment it.