Computer hardware crashes can generally be attributed to two classes of errors: hard errors and transient errors. While both hard and transient errors usually result in downtime for a standard server and initiate a failover recovery procedure in a cluster — the similarities end there.
Hard errors are usually reproducible, consistent and easy to isolate. In contrast, transient errors are unpredictable random events that are virtually impossible to isolate on a conventional server.
Transient Error ProtectionThe Smarter Approach to Uptime" Since day one, Stratus engineering and development efforts have been dedicated to makingits hardware and software solutions the most reliable and available in the industry. Every Stratus server comes equipped withContinuous Processing features that are the result of more than two decades of availabilityachievements. Through its Smarter Approach toUptime, Stratus addresses the many different factorsthat negatively impact solutions availability. One ofthe least understood of these is the transient errorand the risk it poses to maintaining solutions uptimeand data integrity. Two classes of hardware errorsComputer hardware crashes can generally be attributedto two classes of errors: hard errors and transient errors.While both hard and transient errors usually result indowntime for a standard server and initiate a failoverrecovery procedure in a cluster the similarities end there.Hard errors are usually reproducible, consistent and easyto isolate. In contrast, transient errors are unpredictablerandom events that are virtually impossible to isolate ona conventional server. Compounding the problem of transient errors is that theycan cause silent data corruption that results in the systemgenerating false outputs. The consequences can besevere. Irretrievable loss of critical data, costly solutionsdowntime, and failure to meet regulatory compliancemay all be at stake when silent data corruption goesunchecked. What causes transient errors?Transient hardware errors can occur as a result of many different factors: Technology factors:The huge performance gains exhib-ited by today s systems can be attributed directly to theincreasing complexity of integrated circuits. But, thevery characteristics that enable such gains also increasethe likelihood of transient errors. An Intel research anddevelopment paper states, Circuit susceptibility to tran-sient error mechanisms is increasing with each processgeneration. Some are increasing at an exponential rate. 1Additional factors include the increase in transistor densities, lower operating voltages, andincreasing clock frequencies. The Intel paper also cites apaper given at the International Conference onDependable Systems and Networks in 2002. It is predicted that by the year 2011, the error rates in combinational logic will reach the levels at which wehad to have protection in memory. 2Environmental conditions:Natural or man-made radiation and changes in temperature, altitude, andhumidity can all cause transient errors. Design issues: The typical industry-standard server isdesigned with price/performance as its primary goal.When availability is viewed as a secondary objective,minimal design margins are all too often the outcome.Such systems are prone to transient errors when subjected to just the right combination of system load,component manufacturing process, and environmentalconditions. Manufacturing issues:Process control and monitoringissues within the manufacturing process can result inmarginal components that lead to transient errors. Intheir quest to drive down production costs even further,vendors may opt for shorter test cycles or rely on extensive testing of samples practices that also resultin the release of components that cause transient errors. Over time, these factors can cause affected componentsto move from a fully functioning state to an intermittentstate and, finally, to a hard failed state. Depending on thedefect, the component may be in an intermittent state fora relatively long period of time during which transienterrors may occur more frequently. %Hardware- and software-related incidents, including the Microsoft Windows operating system, are part of the measurement. The industry shighest measureduptime Stratus Technologies uncompromising commitment to uptimeis visible every day.We are the first andonly server vendor toreport the dependabilityof our installed baseof systems worldwide.The Stratus UptimeMeterSMis refresheddaily from actual fielddata and displayed onour Web site. Theresults report thatStratus systems surpass five nines of uptime.1 Firmware-based Platform Reliability , Intel Corporation 2004.2P. Shivakumar, M.Kistler. Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic. Untitled DocumentAddressing the Growing Threat of Transient Errors. While no computer system can prevent a transient errorfrom occurring, Stratus line of fault-tolerant systemshas been uniquely engineered to detect, isolate, andwithstand transient hardware errors. Engineered for Error HandlingStratus servers are designed using replicated core system components, including motherboards, processors, memory, I/O buses, and I/O adapters. Atthe heart of the system design is the use of multipleCPU-memory units operating in synchronized operation. Lockstep processing ensures that any errors,including transient errors, are detected and that the system can survive any CPU-memory unit error without interrupting processing and without loss ofdata or state information. In addition to riding through the error condition,Stratus systems capture and log information about thetransient occurrence and will automatically take theaffected component out of service if it reaches a threshold beyond which it has been determined that afailure is likely to occur. In the event a component istaken out of service, its partner component simply continues to operate as normal. Extended Software Protection The Stratus approach to availability extends to the system software as well. Because software is particularly vulnerable to hardware errors, proper errorhandling can avert many potential software problems.With conventional servers, many problems attributed tosoftware are actually caused by transient hardwareerrors. Stratus failsafe software reliably distinguishessoftware issues from hardware issues greatly contributing to effective and timely problem resolution. Stratus fault-tolerant hardware and driver hardeningtechnology shield the operating system, middleware,and application software from the impact of transienterrors. As part of the driver hardening process, Stratusconducts extensive transient, as well as hardware, errorinjection testing. This means potential problems areidentified and resolved before the system is everinstalled at a customer site. The Continuous Processing AdvantageStratus approach to availability is based on a designphilosophy that detects, isolates, and corrects errorsbefore they cause system downtime or corruption ofvaluable business data. Preventing downtime is a keydesign point that differentiates Stratus servers fromconventional servers and high-availability clusters.The result is uninterrupted uptime that has been provento meet or exceed 99.999%. Your applications benefitfrom the design innovations of Stratus servers from thetime you load them on the system; no software modification or special configuration is necessary. Find out how valuable and simple it is to pair yourapplication solution with fault-tolerant systems fromStratus Technologies. While no computer system can prevent a transienterror from occurring, Stratus line of fault-tolerant systems has been uniquely engineered to detect, isolate, and withstand transient hardware errors.www.stratus.com 2005 Stratus Technologies Bermuda Ltd. All rights reserved.X863Specifications and descriptions are summary in nature and subject tochange without notice.Stratus and Continuous Processing are registered trademarks, theStratus Technologies logo is a trademark, and Uptime Meter is a servicemark of Stratus Technologies Bermuda Ltd. Intel is a registered trademark of the Intel Corporation in the UnitedStates and other countries.