Vendors' hard drive failure rates: Myth or metric?

Vendors' hard drive failure rates: Myth or metric?

The statistics of mean time between failures (MTBF) and average failure rate (AFR) have got lots of attention lately in the storage world

This finding is important, the study's authors say, because it will lead the industry to consider other factors when designing reliable storage systems. Such factors include selecting more reliable disk models and shelf enclosures, as well as employing redundancy mechanisms to tolerate component failures, like multipathing, or configuring the storage subsystem with two independent interconnects rather than a single interconnect.


Other statistics

Yet another reliability statistic that's bandied about is mean time to data loss, which is a measurement derived from MTBF. MTDL is used by vendors of storage subsystems, not drive manufacturers, and it takes into account the number of disks involved, resiliency, rebuild time and the amount and type of redundancy offered.

This is a useful number, Harris says, but it's still based more on theory than actuality. In fact, he says, the University of Illinois study calls one of its MTDL's tenets into question. It found that each type of storage subsystem failure exhibits strong correlations; that is, after one failure, the probability of additional failures of the same type is higher, and the failures are likely to happen relatively close together in time.

"Most of the theoretical numbers that people use for mean time to data loss are based on the idea that failures are random, but they aren't," Harris says. "The failures are fairly highly correlated, so the theoretical calculation doesn't match what's observed in the field."

All those numbers aside

Perhaps the issue that the industry is having the most difficulty communicating, Harris says, is that disk drives are mechanical devices, and as such, they wear out. In fact, looking at the Google and Carnegie Mellon studies, once a drive reaches three years of age, its AFR starts rising, he says.

"This is something people with big disk farms have known intuitively for years," Harris says. "If you're just buying a couple hundred disk drives, the number that's important to you is that after about three years, you should be thinking about replacing your drives," depending on how risk-averse you are, how valuable your data is and how good your backups are.

The trouble is, vendor marketing teams need to figure out how to convey that. "It's not an easy problem from a perception standpoint," Harris says.

The three-year mark closely correlates with typical disk warranties, Smith points out. "Although MTBF is supposed to be 136 years, that's way past the warranty on these drives," he says.

Some say the only reason the measurement hasn't been ditched yet is that there's no good alternative. "Should we throw out MTBF? Well, what are we going to replace it with?" Smith asks. There's got to be some way, he says, to show growth in reliability and to distinguish among drives intended to be more reliable than others.

"I think most people would like to have a different measure, but it's hard to come up with what a better metric should be," Schroeder notes.

Smith claims that vendors do have unambiguous MTBF numbers, but they choose not to share them. "The people that know the most about this are the biggest disk array manufacturers," Smith says. "But I don't think they're going to be forthright with giving people that data because it would reduce the opportunity for them to add value by 'interpreting' the number."

Comments

  • Tom A million hours is 114 years Does that mean that half of the drives made before 1897 are still working
Send to a friend

Email this article to a friend or colleague:


PLEASE NOTE: Your name is used only to let the recipient know who sent the story, and in case of transmission error. Both your name and the recipient's name and address will not be used for any other purpose.


ComputerworldUK Webcast

ComputerworldUK
Share
x
Open
* *