Demystifying de-duplication

22.02.2007

Another characteristic used to discriminate target de-dupe products is when data de-duplication processing occurs. Data de-duplication takes time to compute and find commonality in the data being backed up. To minimize the effect on backup performance, some vendors de-dupe data in the background. These de-dupe products buffer the backup stream to disk and then after the fact reduce its size via de-duplication. ExaGrid, FalconStor and Quantum provide target de-dupe products that do background data de-duplication.

Other products can handle the backup stream and de-dupe in band, in real time. Target vendors that de-dupe in band include Data Domain and Diligent. All source vendors de-dupe in band as well. Paradoxically, the in-band vendors are able to sustain full backup stream performance.

The unit of de-dupe granularity, called chunk size, further differentiates de-dupe products. Only NetApp touts a fixed chunk size equal to data block size, according to Schulz. Most de-dupe products claim variable chunk size from the file level down to sub-block level. By using variable chunk size for data inserted into a file, you need only show the changed data as being different, and the rest of the file would be the same. Even more impressively, most de-dupe products can not only reduce data from generations of the same file but also eliminate data copies across files.

Yet another de-dupe difference is file-type sensitivity. Some target products open the backup stream to determine data type and invoke file-type-specific policies to provide better de-duplication. Sepaton claims to have the most file-specific policies. Other target vendors claim that their products perform file-specific de-dupe to a lesser extent, including Data Domain, Diligent and Quantum, which all claim they are completely agnostic regarding backup streaming.

The numbers for de-dupe compression rates range anywhere from 3:1 to 500:1. De-dupe products can sustain these high compression rates because backups generate duplicate data every time a full backup is run. Moreover, beneath the file level, most data is not unique even though a file is modified. But, some data does not de-dupe well, including audio, photo, movie and other media files that simply don't have excess white space or duplicate data.