Vendors' hard drive failure rates: Myth or metric?
The statistics of mean time between failures (MTBF) and average failure rate (AFR) have got lots of attention lately in the storage world
By Mary Brandel, Computerworld | Computerworld US | Published 06:00, 26 April 08
According to Harris, the industry has tried to be less misleading by using AFR instead of MTBF "People want to know, in a given year, what percentage of drives they can expect to fail," says Bianca Schroeder, a co-author of the Carnegie Mellon study.
However, according to the study, the rate of disk replacements is far higher than the AFR percentages provided by vendors. While vendors' data sheets show AFRs between 0.58 and 0.88 percent, the study found average replacement rates typically exceeding 1 percent, with 2 to 4 percent common and up to 13 percent observed on some systems. The study gathered the disk-replacement data of a number of large production systems, for a total of 100,000 SCSI, Fibre Channel and SATA disks.
The study also found that replacement rates grew constantly with age, which counters the usual common understanding that drive degradation sets in after a nominal lifetime of five years, Schroeder says.
One explanation for this is that the study looked at how many drives were replaced, while AFR describes how many disks are expected to fail, and as Schroeder explains, "failure and replacement are not always the same thing." For one thing, users might proactively replace a drive that's just starting to act strange but is actually still functional.
A second factor lies in the vendors' testing environments, which are optimal compared with actual user environments, Schroeder says. Zaffos points out that there are lots of transient events that could cause a failure in a user's environment, including vibration, power surges, dust and humidity levels.
A third disconnect lies in the definition of a failure. "Vendors define failure differently than you and I do, and their definition makes drives look more reliable than what you and I see," Harris says on his blog. That's because when disk drive vendors get a drive returned to them marked "bad," they plug it into their test equipment, and if they find no problem with it, they dub it "no trouble found," or NTF, increasing the reliability measure of returned disks.
In fact, vendors typically report "no trouble found" with 20 to 30 percent of all returned drives, he says. But, Harris says, you can take that same disk and plug it back into the user's server, and it won't work. Of course, to the user, it's still a bad disk that needs to be replaced. In fact, both versions of the truth can be valid at the same time.
As the Google study found, disk failures are sometimes the result of a combination of components, like a particular drive with a particular controller or cable. "A good number of drives could be still considered operational in a different test harness," the study says. "We have observed situations where a drive tester consistently 'green lights' a unit that invariably fails in the field."
The University of Illinois study verifies that finding. Although disks contribute to 20 to 55 percent of storage subsystem failures, other components such as physical interconnects and protocol stacks also account for significant percentages - 27 to 68 percent for physical interconnects and 5 to 10 percent for protocol stacks. The study used real-world field data from NetApp, analysing the error logs collected from about 39,000 commercially deployed storage systems. The data set included 1.8 million disks hosted in about 155,000 storage-shelf enclosures.











