How to Implement Next-Generation Storage Infrastructure for Big Data

16.04.2012

Reliability and availability is mission-critical for Shutterfly, suggesting the need for enterprise-class storage. But its rapidly inflating storage costs were making commodity systems much more attractive, Day says. As Day and his team investigated the potential technical solutions to getting Shutterfly's storage costs under control, they got interested in a technology called erasure codes.

Reed-Solomon erasure codes were originally used as forward error correction (FEC) codes for sending data over an unreliable channel, like data transmissions from deep space probes. The technology is also used with CDs and DVDs to handle impairments on the disc, like dust and scratches. But several storage vendors have begun incorporating erasure codes into their solutions. Using erasure codes, a piece of data can be broken up into multiple chunks, each of them useless on their own, and then dispersed to different disk drives or servers. At any time, the data can be fully reassembled with a fraction of the chunks, even if multiple chunks have been lost due to drive failures. In other words, you don't need to create multiple copies of data; a single instance can ensure data integrity and availability.

One of the early vendors of an erasure code-based solution is Chicago, Ill.-based Cleversafe, which has added location information to create what it calls dispersal coding, allowing users to store chunks, or slices as it calls them, in geographically separate places, like multiple data centers.

Each slice is mathematically useless on its own, making it private and secure. Because the information dispersal technology uses only a single instance of data with minimal expansion to ensure data integrity and availability, rather than multiple copies as with RAID, Cleversafe says, companies can save up to 90 percent of their storage costs.