system signals could be useful in conjunction with SMART data to create more powerful models. We plan to
explore this possibility in our future work.
Figure 14: Percentage of failed drives with SMART errors.
4 Related Work
Previous studies in this area generally fall into two categories: vendor (disk drive or storage appliance)
technical papers and user experience studies. Disk vendors studies provide valuable insight into the
electromechanical characteristics of disks and both model based and experimental data that suggests how
several environmental factors and usage activities can affect device lifetime.
Yang and Sun  and Cole  describe the processes and experimental setup used by Quantum and
Seagate to test new units and the models that attempt to make long term reliability predictions based on
accelerated life tests of small populations. Power-on hours, duty cycle, temperature are identified as the key
deployment parameters that impact failure rates, each of them having the potential to double failure rates
when going from nominal to extreme values.
For example, Cole presents thermal de-rating models showing that MTBF could degrade by as much as
50% when going from operating temperatures of 30C to 40C. Cole s report also presents yearly failure rates
from Seagate s warranty database, indicating a linear decrease in annual failure rates from 1.2% in the first
year to 0.39% in the third (and last year of record).
In our study, we did not find much correlation between failure rate and either elevated temperature or
utilization. It is the most surprising result of our study. Our annualized failure rates were generally higher
than those reported by vendors, and more consistent with other user experience studies.
Shah and Elerath have written several papers based on the behavior of disk drives inside Network
Appliance storage products [6, 7, 19]. They use a reliability database that includes field failure statistics as
well as support logs, and their position as an appliance vendor enables them more control and visibility into
actual deployments than a typical disk drive vendor might have. Although they do not report directly on the
correlation between SMART parameters or environmental factors and failures (possibly for confidentiality
concerns), their work is useful in enabling a qualitative understanding of factors what affect disk drive
For example, they comment that end user failure rates can be as much as ten times higher than what the
drive manufacturer might expect ; they report in  a strong experimental correlation between number of
heads and higher failure rates (an effect that is also predicted by the models in ); and they observe that
different failure mechanisms are at play at different phases of a drive life time.
Generally, our findings are in line with these results. User experience studies may lack the depth of insight
into the device inner workings that is possible in manufacturer reports, but they are essential in
understanding device behavior in real world deployments. Unfortunately, there are very few such studies to
date, probably due to the large number of devices needed to observe statistically significant results and the
complex infrastructure required to track failures and their contributing factors.
Talagala and Patterson  perform a detailed error analysis of 368 SCSI disk drives over an eighteen
month period, reporting a failure rate of 1. 9%. Results on a larger number of desktop class ATA drives
under deployment at the Internet Archive are presented by Schwarz et al . They report on a 2% failure
rate for a population of 2489 disks during 2005, while mentioning that replacement rates have been as high
as 6%in the past. Gray and van Ingen  cite observed failure rates ranging from 3. 36% in two large web
properties with 22,400 and 15,805 disks respectively. A recent study by Schroeder and Gibson  helps
shed light into the statistical properties of disk drive failures. The study uses failure data from several large
scale deployments, including a large number of SATA drives. They report a significant overestimation of