To improve uptime, datacentre owners and operators have traditionally focused on the physical infrastructure that supports IT, incorporating independent redundancies, monitoring systems, failover schemes and more.
On the whole, the strategy has worked, yet research by Uptime Institute (an independent division of The 451 Group) and others shows that large-scale outages continue to plague the datacentre industry - and that some operators continue to do much better than others. The level of downtime that still occurs might be surprising to some, given the significant economic consequences of service disruption and widespread use of standards, techniques and technologies dedicated to maintaining continuous availability.
A major reason for some of the continuing problems, according to Uptime, is that design alone cannot guarantee datacentre efficiency or availability. Operations management (e.g., capacity management, change management, incident management), maintenance strategies, staff and contractor training, and emergency-response procedures all affect availability. There are signs that this message is beginning to sink in: Uptime is reporting that a growing number of datacentre owners and MTDC clients are requiring third-party validation of operational best practices to ensure optimal facility performance. (An analogy might be that airline owners and passengers don't just want to know that an aircraft is certified as functional when it leaves the factory - they want to be assured that the crews know how to maintain and fly it safely.)
In its latest Start with the End in Mind initiative, Uptime goes even further. In new datacentres, it says, operations holds the key to efficiency (including availability). Operations is the ultimate client in any datacentre expansion, Uptime asserts, and as such should be integral to the project from conception. By focusing on the way the datacentre will be run from the earliest planning stage, owners increase the efficiency, uptime and ROI of their facilities while reducing cost and risk.
Most datacentre outages are caused by human error
Partly as a result of the success of the Tier-classification system and the general adoption of redundancy in datacentre designs, overall datacentre outages caused by component failures are rare, and attempts to increase uptime solely through improvement of the physical infrastructure are reaching a level of diminishing returns. Less than one-third of the unplanned outages reported in a recent survey conducted by the Ponemon Institute for Emerson Network Power were attributed to equipment failures, and respondents reported that most of those outages were avoidable: almost all were attributed to either human error or an equipment failure that may have been prevented had adequate training, monitoring or maintenance procedures been in place.
These results are not dissimilar to Uptime's Networks data. Uptime's Networks have tracked incidents and outages in member facilities for over 25 years, and have compiled a detailed dataset of more than 5,000 incidents in over 400 datacentres. Although outages in Networks member facilities are exceedingly rare (approximately one per decade), virtually all can be traced back to human error.
These findings reinforce the importance of a comprehensive management and operations program to ensure datacentre availability and maximise efficiency. Organisations that closely align datacentre operations with business objectives and use industry best practices as the benchmark for continuous monitoring and improvement optimise datacentre performance and realise the most efficient return on their investment possible.
The new focus on training and operations from Uptime opens up further possibilities: while availability is generally good, almost all research suggests that energy efficiency and use of capacity is not. Over time, the focus on training and ongoing operations may offer a new channel for disseminating best practices in capacity management and energy efficiency.
Will independent verification of operational best practices become a requirement?
As the physical infrastructure of the datacentre becomes increasingly commodified, operational performance rises in prominence. For multi-tenant datacentre (MTDC) operators and other IT service providers that need to meet client-imposed uptime requirements, an objective performance assessment can be a key differentiator.
Most organisations have internal training and review procedures in place, and there are some standards developed for other industries or other purposes (e.g., ISO, ITIL, SSAE 16, SAS 70, EN 50600) that have been adapted to also address datacentre facility availability. Historically, however, third-party validation of operational best practices based on a datacentre-specific system has not been generally available.
That has now changed, and facility owners are taking note. Uptime Institute, in consultation with industry stakeholders, has developed one operations standard that is delivered based on two operations-assessment protocols specifically designed for datacentres and created by datacentre owners: Tier Standard: Operational Sustainability, for Tier-certified facilities, and the M&O (Management and Operations) Stamp of Approval, for datacentres that are not Tier certified.
Both these methodologies address the site management behaviours and decisions that impact long-term datacentre performance, such as staffing and organisation (staffing levels, qualifications and skill mix); training and professional development; preventative maintenance programs and processes; operating conditions and housekeeping; planning management; coordination practices and resources; and more.
Are we entering a new stage in datacentres, where operations are certified? Certainly, design and build certification has become increasingly important in recent years. It is now common for owners to include design or constructed facility certification requirements in datacentre construction request for proposals (RFPs) - and for potential tenants to ask for certification from MTDC operators.
Now that credible operational certification is available, an increasing number of owners and tenants are including requirements for operational certifications in their facility management RFPs; some even carry significant penalties if the contractor fails to meet or sustain minimum standards. For example, the Province of Ontario recently included a requirement for operational certification in an RFP with a $1m penalty should their IT service provider fail to comply. (A detailed case study is available on the Johnson Controls Global WorkPlace Solutions website.)
Operations holds the key to reliability
Operational excellence is not just about availability, but also efficiency. Uptime Institute research and field experience indicates that even in new builds, operations holds the key to efficiency.
The design-build phase is typically less than 5% of the datacentre's lifespan, yet the team responsible for 95% of the facility's life - the operations team - is often not involved until the facility is commissioned. This is a mistake, Uptime states: organisations that view datacentre expansion as a 'design build operate' process rather than a function of change management put the efficiency, uptime and ROI of their facilities at risk. Uptime reports that datacentres where operations staff were integral to the construction process from conception run more reliably and profitably from day one.
And according to Uptime, conception really does mean 'conception': in the most efficient and reliable datacentres, those who operate it are brought into the new build, retrofit or expansion process in the preconstruction/planning phase. This ensures that the team that will run the facility on a daily basis is involved in the decisions that will affect how efficiently it can be run.
This observation is the inspiration behind Uptime's Start with the End in Mind initiative. Led by Lee Kirby, CTO of Uptime Institute and former senior executive at Lee Technologies, Uptime's new program details how design/build and operations development should occur simultaneously. A typical datacentre build, retrofit or expansion process involves five phases: pre-construction, design, construction, commissioning and turnover.
Involving the operations team at each phase of the process will ensure not only that the facility is engineered to optimise maintainability, but also that the operations team can provide continuity for knowledge management and transition to production. Certifications, if desired, are incorporated as milestones, and review and optimisation of operational procedures continue as an iterative process throughout the facility's lifespan - ensuring that, as Uptime puts it, 'it doesn't end in tiers.'
The table below shows the activities that should occur concurrently to ensure the facility is running optimally on day one.
The 451 take
Most datacentre outages are caused by human error. This can never, of course, be eliminated, but the risks can be reduced by systematically and consistently following a program of operational best practices. Datacentre operators that want to improve their facility's reliability may find the operations-assessment protocols offered by Uptime a helpful resource, and their Start with the End in Mind initiative reinforces the role of the operations team in optimising efficiency.
Although the argument for obtaining third-party verification of operational performance is less obvious for the typical enterprise owner/operator, certification could be a key differentiator for an IT service provider. Uptime reports that an increasing number of clients are including operational-certification requirements in their RFPs. This has implications for facility managers and MTDC providers alike.
Data centre uptime will be one of the topics discussed at 451 Research’s Hosting and Cloud Transformation Summit 1st - 2nd April, London.