Measuring backup health

If you can't measure it, you can't manage it. Some may argue that there are exceptions to this truism, but backup/recovery is not one of them. While there is certainly a growing effort toward measuring backup success rates, this metric alone is not sufficient to signify a healthy backup environment. While success rate is one important risk indicator, additional risk metrics, along with other efficiency and service-level metrics, are necessary to build a true picture of backup health. Here are a few more metrics to consider:

1. Partial backup completion -- These should be regarded as failures until their cause is understood, but in many environments, partials are counted as successes or simply not reported. The rationale is that some backup jobs include temporary files held open by applications and cannot be backed up successfully. But without a detailed investigation, it is impossible to know whether this is really the case. If the temporary files are truly benign, then they should be added to an "exclude" list to avoid nuisance messages. Logs and reports filled with messages about partial backups cloud the ability to identify real problems.

2. Consecutive backup failures -- It's problematic when a system backup fails in a single backup cycle, but it can be disastrous if subsequent backups of the same system fail repeatedly. This extends Recovery Point Objective metrics well beyond committed service levels and actually occurs more frequently than one might expect. A good reporting system should flag consecutive backup failures.

3. Media utilization -- This is an important efficiency metric that is often overlooked. Tape media is expensive and in many environments utilization is significantly below 70 percent. In one multipetabyte backup environment, we found that a 10 percent improvement in media utilization would translate into nearly US$400,000 of annual tape savings.

4. Tape drive performance -- Poor tape drive performance not only results in low-drive utilization but also increases risk of backup failure and reduces the usable life for both media and drive mechanics. Modern tape devices are capable of 40MB/sec. throughput and higher, yet actual performance rates of less than 10MB/sec. are commonly observed. Many environments are simply unaware of their dismal performance levels and exacerbate the problem by making future design and purchasing decisions based on unrealistic expectations.

Unfortunately, these metrics are not easily obtained through the reporting capabilities of traditional backup applications, but an investment in tools or services to produce this data might be a very good investment.