The National Security Agency's new data center in Utah was built for a 65 megawatt load, making it one of the world's largest data centers. But it has had a rough start.
The $1.53 billion data center was built on a "very aggressive" schedule, according to the government, with groundbreaking in early 2011 and completion scheduled by the end of this year. But a report of "meltdowns," "flashes of lightning" and damaged equipment caused by arc faults raised questions about the timetable.
There are many potential points of failure in this approximately 1 million-square-foot facility in Bluffdale, Utah, 23 miles south of Salt Lake City. The NSA says 100,000 square feet of the center is mission critical raised floor space. The balance of the mall-sized complex is for other support and administrative functions.
The amount of wiring and compute equipment used is hard to imagine. Details found in government records speak to its complexity. For instance, among the systems installed are 60 diesel emergency standby generators, each capable of producing 3,000 kW, according to a Utah environmental quality report.
The Wall Street Journal first reported the electrical problems in documents it obtained. The Salt Lake Tribune subsequently obtained an NSA document sent to a congressional oversight committee that was less alarming. It said that said none of the compute equipment was damaged and electrical problems were localized to breaker boxes. An NSA spokesman confirmed the accuracy of the Tribune's report about the letter, but the agency isn't releasing additional information.
So what went wrong?
"No one is immune from a lack of discipline," said Hudson Denney, a founder and principal of Net3 Technology, a provider of cloud-based managed services and data center platforms. "Electricity could care less if it's a billion-dollar government facility, a manufacturing plant, or a regional data center."
While Denney can't speak to NSA's problems directly, he is not surprised by them.
"We can't begin to recall all the times we have seen a generator turn on without a transfer switch throwing, which ends up with a down data center and a lot of wasted diesel," Denney said. "We have had engineers work on server issues for hours just to find out a $2 cord went bad or the jack wasn't punched down correctly by a contractor.
"Sometimes people just get it wrong," Denney continued. "The redundant power isn't [there], the screws on the breaker panel are not tight, and the fuel in the generator wasn't cycled properly."
Similarly, Andy Pace, the chief operating officer of SingleHop, a dedicated server and cloud hosting company, said the sheer scale of the NSA data center "guarantees a complex power system.
"A great deal of the hardware in the facility is custom-built and that will also add a layer of complexity," Pace said. Problems can span from the human side, including poor quality assurance and oversight, to issues such as lack of redundancy, hardware faults and bad equipment. "What is most important with a problem like this is not the fact that it happened, but how quickly it can be solved," he said.
An arc fault happens when there is a broken pathway between two conductive objects, Pace said. In most cases, the air becomes the conductor bridging the gap between these two objects, which causes an arc. Eventually, it can begin to burn and melt anything, "and create quite the explosion."
Pace said a possible scenario for an arc fault is when there are two wires that aren't designed to handle the electric current that is running through them and the protective covering melts. An arc fault can then be created between the two exposed wires, he said.
Although the NSA is saying nothing about the amount of compute power in this facility, comparisons are possible.
The largest computing system in the world, China's Tianhe-2, which was developed by China's National University of Defense, has about 3.12 million compute cores and uses nearly 18 MW of power, or about 28% of the Utah data center load.
China has plans to build data centers that can support 100 MW of power, according to Steve Conway, an IDC high-performance computing analyst. Most large supercomputing centers today operate with 10 MW to 25 MW loads.
In the U.S., the largest publicly known computer is the Titan, a Cray system at the Oak Ridge National Laboratory in Tennessee. It has 18,688 AMD 16-core Opteron chips and 18,688 Nvidia Kepler K20 GPUs that use 8.2 MW.
Microsoft's 700,000-square-foot Chicago data center, which opened in 2009, was built for 30 MW with the possibility of adding on another 30 MW.
The government doesn't talk about IT spending on intelligence, nor does it provide details on the amount of data center spaces these agencies have. But there are few data centers that need as much as power as the one the NSA is building in Utah. The agency doesn't say whether it has expansion plans for the Utah complex, but almost all data centers of this type are built with the idea of expanding capacity.
The size of a single data center is, in the big scheme, not necessarily telling about the amount of compute power of any entity. Large companies, including Microsoft, Google and Amazon, as well as government agencies, have multiple data centers, with ever increasing demands for space and power.
This article, Snap, crackle, pop: Electrical troubles shed light on NSA's data center, was originally published at Computerworld.com.
Patrick Thibodeau covers cloud computing and enterprise applications, outsourcing, government IT policies, data centers and IT workforce issues for Computerworld. Follow Patrick on Twitter at @DCgov or subscribe to Patrick's RSS feed. His e-mail address is [email protected].
Read more about data center in Computerworld's Data Center Topic Center.