Guidelines for determining backup health

30.01.2007
In previous columns I've emphasized the need for backup reporting and metrics to ensure that data is protected appropriately. However, even with the benefit of regular, successful backup reports the fact remains that the devil is in the details. It is important to go beyond a raw statistic, like the percent success or failure, to properly analyze and interpret the actual meaning. To that end, here are three fundamental guidelines to apply when attempting to determine backup health.

1. A 1 percent failure rate out of a million backup jobs is still 10,000 failures -- A high backup success rate doesn't guarantee a risk-free environment. Even one failure can have serious repercussions if, for example, it means that a critical application is at risk. All backup failures must be understood and remedied, so don't be lulled into a false sense of security by the numbers.

2. Not all errors are created equal -- Backup jobs fail for a wide variety of reasons. Some represent serious issues relating to improper configuration, operational flaws or resource contention. Others, however, can be attributed to "housekeeping" issues, such as attempting to backup retired hosts, or for reasons relating to subtle operational quirks of some backup applications. These errors can often get in the way of resolving "real errors."

3. Partial successes are also partial failures -- One of the most common self-induced risks in backup environments is ignoring partial backups. These are backup jobs that complete successfully, but not all files were able to be backed up -- the most common reason being that the files are being held open and locked by another application. Often, these are temporary files that do not need to be backed up. The problem is that without analysis, there is no way to determine whether this is truly the case or if, perhaps, a critical database file is not being properly backed up.

Organizations get into trouble because they lack the time and resources to properly analyze and address these failures. The more complex problems can often require the involvement of cross-functional personnel. This is why it is so critical to identify and eliminate repeated nuisance errors and partial backups. Otherwise, given all of these constraints, they run the risk of losing sight of truly critical errors among the noise.

Jim Damoulakis is chief technology officer at GlassHouse Technologies Inc., a leading provider of independent storage services. He can be reached at jimd@glasshouse.com.