ComputerworldUK speaks to one of the labs providing data crunching power for the ongoing physics experiment responsible for discovering the Higgs boson particle - credited with unlocking some of the secrets of the universe.
The newly upgraded Large Hadron Collider is being rebooted after a circuit malfunction stalled its go live, but data-centre providers that support the data crunching are preparing their infrastructure for a hammering once CERN has eliminated the fault.
One such data-centre provider, Canada's national laboratory named TRIUMF, is gearing up to process and distribute the masses of data that comes from the detector, 24-hours-a-day, following an estate refresh.
TRIUMF works with ATLAS, a particle physics experiment at the Large Hadron Collider. It’s detector, a “cathedral-sized” construction, monitors high energy particle collisions.
It is the debris of these collisions that TRIUMF processes, along with nine other Tier 1 data centre providers across the world.
“We filter millions of collisions to produce 1,000 events per second that are stored for analysis...In order for the raw data to be useful for physicists, it must be reconstructed and calibrated like pixels from an electronic image,” Reda Tafirout, research scientist at TRIUMF and director of operations for the data centre, says.
A world-wide grid of HPCs
Around 3,000 scientists across 177 facilities and 38 different countries are taking part in the Large Hadron Collider experiment, which depends entirely on a network of high-performance computing centres, linked by grid tools to act as one system.
Amongst them, they store over 165 PBs of data.
Paging the team at 3am
To ensure it can continue to play a major role in the project, TRIUMF needs to offer “at least 99 percent availability”, Tafirout says. If one of the data-centres in the ATLAS grid goes down, it is difficult for its peers to pick up the slack as they each only have a dedicated amount of capacity.
“There’s a lot at stake - we have to respond to problems within four hours. A Tier 1 centre is not something running on ‘best efforts’, we really run 24/7. If one data centre is out it is a disaster” The team even have a paging system which will wake them at 3am if needs be, he adds.
Last summer, TRIUMF refreshed its estate with a converged architecture built on a Data Direct Networks (DDN) Storage platform and embedded dCache, a virtual file system, which has reduced costs five-fold.
“We went from 21 4U servers requiring multiple racks to one DDN SFA 12K system with eight virtual nodes in half a rack; power went from 25 to less than 7 kilowatts in addition to major cost savings on cooling,” he says.
Using dCache, which stores and retrieves massive amounts of data distributed among a large number of varied server nodes under one virtual file system, TRIUMF can easily exchange research and results from the experiment automatically and invisibly to users, eliminating location, capacity and latency concerns. DDN’s In-Storage Processing feature enabled TRIUMF to embed dCache within the storage controller, consolidating both its storage processing and data distribution requirements in one system, with a single pane of glass management.
A cheaper alternative may have been to go down the open-source route. But Tafirout says “If you have infinite space you could go with the cheapest solution, but you would end up with hundreds of servers. In the end it isn’t a good strategy, as some will fail and you will end up fixing problems most of the time.”
“With DDN, we have no problem accommodating close to 5,000 data hungry cores hammering on our storage at the Tier-1 centre."