Share

A unique US government-funded computing effort is making it easier for corporations to access the largest-scale computers on the planet. Dubbed TeraGrid, the effort spans nine different academic and government institutions and has reached a critical mass this year.

The notion is to combine the largest supercomputers into a global processing and storage grid to tackle the thorniest computing problems. "We want to make available high-end resources to the broadest community," says Dane Skow, who is the director of the Grid Infrastructure Group and performs the operational coordination of TeraGrid from the University of Chicago's Argonne National Lab. "We want to leverage our top-of-the-line equipment for people who don't have the skills to do it themselves."

TeraGrid began with grants in 2000 to the Pittsburgh Supercomputing centre. It has grown by adding other supercomputer centres around the country, and it just completed a second user conference held at the University of Wisconsin earlier this month.

Part of the TeraGrid is a simple user interface for the world's largest distributed computing environment, the ultimate graphical user interface (GUI) on steroids. "The point of TeraGrid is to pull together the capabilities and intellectual resources for problems that can't be handled at a single site," says Rob Pennington, the deputy director of the National centre for Supercomputing Applications (NCSA). "We make it easier for researchers to use these multiple computing sites with a very small increment in training and technical help."

Big numbers

The numbers are staggering, even for IT managers who are used to big projects. The TeraGrid network currently spans more than 20 petabytes of storage - that's enough to hold a billion encyclopedias - and more than 280 teraflops of compute power.

While there are big numbers involved in describing TeraGrid, "we want to be more than just a source of computing cycles," says Pennington. What TeraGrid is trying to accomplish is to produce a common means of accessing processing power and storage on the largest scale, freeing people from doing custom programming jobs. "We are trying to make it better on the front end," says Skow.

"This isn't just about providing some time on a big machine but being able to solve all the plumbing problems so that we can have a uniform end-to-end and integrated experience for all kinds of research," says Bill Bell, the deputy director of the NCSA.

And the TeraGrid is only going to get bigger, thanks to a combination of various government, military and private sources. "We are at the beginning of a very aggressive growth curve thanks to the National Science Foundation," says Skow. "The sized of the problems that a researcher can attack is going to be enormous and on a scale that was never possible before. We are going to see basic science being able to leapfrog and go far beyond what people have been able to do previously."

Practical supercomputing

This isn't just a bunch of academics figuring out the next billion decimal places of pi or being able to draw pretty fractal pictures faster. While certainly many of the projects have to do with advancing basic science research such as studies of black holes, climate modelling and data visualisation, there is a big component of commercial and industrial research, too. This means that TeraGrid can have direct benefits in a number of different commercial markets and disciplines, including understanding how ketchup works. (Maybe this science will once and for all settle whether it is a vegetable or not.)

One TeraGrid project was an effort to vastly increase the number of special chemical catalysts called zeolites, which are used in a wide variety of industrial processes, from making laundry detergents to refining petroleum products. Until recently, the total number of natural zeolites stood at about 50. Then a chemical engineering professor from Rice University, Michael Deem, developed an application that ran on several TeraGrid supercomputers to come up with millions of different chemical structures that could be used for future catalytic purposes.

Another example of a commercial application of supercomputing technology was a project done for The Procter & Gamble (P&G) to improve the Pringles potato chip production line. The company had a problem with the high speed of the manufacturing line creating air drafts that were blowing the chips out of the cans during the packing process. P&G was able to cut down on wasted chips by using computational fluid dynamics models developed by aircraft maker Boeing, said Melyssa Fratkin, corporate and government relations manager at the University of Texas Advanced Computing centre (TACC). Call this another form of chip processing!

And while eBay and Google are not participating in TeraGrid, they use similar designs for their own data processing empires. "EBay runs its transactional system on something that looks very similar to the TeraGrid," says Paul Strong, a research scientist at eBay. "Our auction platform runs on a grid of more than 7,000 servers, we compile our applications on a grid of several hundred servers. So we share some of the same problems in terms of managing very large and complex computing infrastructures." Strong says he expects TeraGrid to help users understand and address challenges like those at other companies that are going to require additional computing horsepower.

There is plenty of other work going on across the TeraGrid, so much so that the NCSA has dedicated hardware for commercial users. "We have a machine called T3 that is dedicated for our industrial partners," says Pennington. "That is a 22-teraflop Dell blade server that is kept constantly busy and is the second-largest system in our data centre."

How to partner with the TeraGrid

Private-sector companies looking to engage in large-scale computing projects can become corporate affiliates of the TeraGrid member academic or research institutions through initiatives such as the NCSA's Private Sector Program, which is perhaps the longest-standing effort of that kind. The NCSA was the place where the Web was first turned into a commercial medium thanks to the Mosaic browser back in the early 1990s.

Affiliates typically pay an annual membership fee that is scaled to reflect the level of the project and the resources required. The TACC is just beginning its industrial partner program and charges between $5,000 to $25,000 per year for affiliate memberships.

Alternative funds are available for academic researchers through government sponsorship, and researchers can apply for access via the TeraGrid Web site. A program from the US Department of Energy (DOE) called Innovative and Novel Computational Impact on Theory and Experiment (INCITE) was recently expanded for commercial purposes. Last year, Dreamworks, Corning, General Atomics, Pratt and Whitney, P&G and other companies got grants for their projects through INCITE. The DOE invites commercial applications for supercomputing projects at its various national labs. Applications are due 8 August, and the awards will be announced by the end of the year.

No matter how companies partners with the TeraGrid program, they can tap a lot of computing horsepower when they do, and the grid is continuing to grow as more machines are purchased and as computer scientists figure out better ways to harness widely distributed processing. The most extreme example of harnessing computing power (but not part of TeraGrid) was an experiment three years ago at the University of California, San Francisco, called FlashMob Computing, which tied together several hundred computers that were donated for the weekend by local volunteers.

Search for aliens

And there have been projects like [email protected] that scavenge spare cycles on computers around the world to process data cooperatively. These initiatives have spurred recent developments in supercomputers, which are, after all, just large collections of Intel-class processors closely tied together.

"Clearly, cycle scavenging is attractive for certain types of organisations and workloads," says eBay's Strong. "In enterprises, you can't manage the quality of service and you give up some of the control once these machines are outside of the data centre. But some of the same technologies such as workload-scheduling are similar and used for today's supercomputing applications."

An example is what is happening at Purdue University. There, more than 6,000 computers are tied together across its several campuses as part of its TeraGrid effort, which is called Condor. The project includes using idle time on ordinary student desktops teamed up with the specialised high-performance multiprocessor machines in its data centre.

In the fall, Sun Microsystems will add a new machine that will run at more than 500 teraFLOPS at the TACC. The NCSA will add an 89-teraflop dual-boot Linux/Windows machine from Dell that is being called Abe (after Lincoln). Other TeraGrid locations are similarly boosting their processing power and storage capacity. Just to put this in perspective, the largest machine on the Top 500 list of supercomputers, IBM's Blue Gene/L machine at Lawrence Livermore Nation Laboratory, has 280 teraFLOPS, while No. 10 on the list, a Cray XT3 at Oak Ridge National Laboratory, has 43 teraFLOPS. Clearly, the TeraGrid will have a few entries in the Top 500 when the list is updated this fall.