Vendors' hard drive failure rates: Myth or metric?
The statistics of mean time between failures (MTBF) and average failure rate (AFR) have got lots of attention lately in the storage world
By Mary Brandel, Computerworld | Computerworld US | Published 06:00, 26 April 08
There's also the idea of collecting and using field data. But even that's problematic, Schroeder says. For one thing, the study showed disk failures increasing significantly as the disk aged. So data gathered from disks that are, say, one-year old, would not apply to others that were a different age, she points out.
"Maybe you'd need an AFR for each year that the drive is in use, but even so, there are so many other factors, such as operating conditions and workload," Schroeder says. "There are so many factors that impact drive reliability that it would be hard to come up with a realistic model." However, she agrees that it would be helpful for vendors to provide more data, such as field replacement rates and number of latent sector errors. Carnegie Mellon is working with Usenix to create a failure data repository for drives of various types, ages and capacities.
What affects customers even more than disk reliability, Smith says, is batches of bad drives, especially in light of the Google study's finding that one drive failure highly correlates to other failures.
"That's why it's so important that vendors put an infrastructure in place that allows them to do physical analysis of the installed base to find out if there's a systemic problem - a microcode bug or a bad batch of components," Zaffos says. "That's what makes it important to have a mature service/support organisation that is able to track history and look for patterns."
Is MTBF relevant?
Have we reached a point where the disk drives are so reliable that we don't need to concern ourselves with numbers like MTBF or AFR? "I don't know the answer," Smith says. But he acknowledges that he himself doesn't look carefully at MTBF on drives before he buys them.
"I believe down to my bone marrow that the MTBF on these drives is so high that I don't have to worry about it much," Smith says. "Do you make a distinction between someone's numbers that are 10 percent different? What's the difference between a million hours and 1.5 million?"
At the same time, the more the industry learns about what makes one storage system more reliable than others will only enhance the way these systems are architected. Reliability has greatly improved since the days when vendors really did have to prove that their disks were trustworthy, but the degree to which businesses rely on these components has also increased exponentially. While 25 years ago, a 25,000-hour MTBF was seen as pretty good, compared with today's million-hour numbers, "the trouble seems to be growing faster than their reliability," Harris says.











