Skip Nav U.S. Army Research Laboratory DoD Supercomputing Resource Center
Sitemap Contact Us Quick Links

Cover Story

ARL MSRC increases overall computing power from 9.1 to 36 TFLOPS

By Mike McCraney, Systems Integration Lead

Contracts and plans have now been finalized for the Technology Insertion 2004 (TI-04) upgrades for the ARL MSRC. The High Performance Computing Modernization Program Office (HPCMPO), in coordination with GSA, has inked contracts with three high performance computing (HPC) vendors that will, in one giant leap, vault the ARL MSRC from 9 trillion floating-point operations per second (TFLOPS) to more than 36 TFLOPS.

Not surprising, the facilities, systems engineering and expansion analysis teams at the ARL MSRC are preparing the center for what could be considered an onslaught of HPC power scheduled to arrive this summer.

Picture of SGI Altix Picture of Linux Networx Evolocity II Picture of IBM Opteron
SGI Altix Linux Networx IBM Opteron

“This increase in computing capability will give DoD scientists and engineers the ability to solve complex, three-dimensional, time-dependent, physics-based problems in a timeframe that can provide the data necessary to assist with weapon development and procurement decisions,” said Charles J. Nietubicz, Acting Deputy Director of the Computational and Information Sciences Directorate (CISD).

“ The ARL MSRC serves a diverse, technically challenging HPC user population,” said Denice P. Brown, Acting Center Director of the ARL MSRC. “The selection of Linux NetworX, IBM, and SGI systems provides the flexibility to meet the users’ diverse challenges.”

As always, the key focus of the ARL MSRC is to provide service and support to the warfighter. These systems will provide the massive compute power required by DoD scientists and researchers to do just that.

HPC technology

The ARL MSRC has consistently been a leader in new technology trends in HPC within the HPCMPO program, and these upgrades will carry forward this tradition. In recent years, the HPC industry has seen a significant shift away from the traditional proprietary HPC technology to commodity-based systems. The ARL MSRC was on the leading edge of these trends, installing small test systems by Compaq and IBM in the fall of 2001. Based on encouraging, yet limited benchmark runs from these two small machines, the ARL Expansion and Analysis team could clearly see the huge advantage in price and performance of these commodity systems in the predominantly MPI-based code base at the center.

That point was further driven home during the TI-03 procurement phase when the ARL MSRC installed a 256-processor Intel-based Linux cluster from Linux NetworX. The system has proven to be a steady workhorse for mid-sized MPI jobs and has maintained a solid reliability stance since its installation in the fall of 2003.

The systems

Following the earlier successes of the 256-processor system Powell, Linux NetworX has been called upon to deliver another system to the ARL MSRC environment. This new system will be comprised of 1,024, dual Intel Xeon processors, 3,072 GB RAM, and the latest Myrinet interconnect. The system will be supported by a full Gigabit Ethernet backbone and 50 TB of available RAID. With the 3.6 GHz clock speed, the addition of 2,048 Xeon processors will boost the unclassified environment by nearly 14 TFLOPS.

Installed alongside the 1,024-processor IBM Power3 system, the 256-processor Linux NetworX system will help support the demand for scalable parallel systems well into the future. With the continued support of the SGI Origin 3800 and IBM Power 4 system with 128 GB memory, the center will continue to support large shared memory requirements as well.

On the classified side, IBM has been called upon to deliver a 2,304 processor, AMD Opteron-based Linux system. The 2.2 GHz Opteron processors will boost the classified compute power by more than 10 TFLOPS. With a full 1.5 GB RAM per processor, this system will also bolster the overall available classified memory capacity.

Finally, the ARL MSRC returned to its roots by receiving a 256-processor Altix system from Silicon Graphics (SGI). The Altix system relies on the proven NUMALink interconnect to provide not only inter-processor communications but also non-uniform global access to all system memory. Like the SGI Origin systems that preceded it, the Altix system will be able to support both scalable MPI and large shared memory jobs. A smaller test and development system will be installed to serve as a precursor to the full system, providing a platform for code porting and optimization, then later will serve as a development system upon that the system’s administration team can test and debug new versions of the operating system and or recommended patches and optimization flags.

Providing balance

Although the ARL MSRC takes great pride in providing some of the largest compute systems in the world, the management of a center like the ARL MSRC can truly be a balancing act. Certainly sheer compute power is important; power to the users and to the desktop is probably the prime resource the center offers. But sheer power cannot solve all the problems of the complex user base at the ARL MSRC. New commodity-based systems can certainly support large-scale, extremely parallel jobs that employ MPI for communications.

Capability jobs beyond a thousand processors will more than likely become commonplace in the coming years with the sizes of systems being installed.

However, the ARL MSRC traditionally supports several key users with several key codes that are extremely memory intensive. And in some cases, these codes are not particularly scalable. Enter the balance variable. During the TI-03 upgrade cycle, the ARL MSRC chose wisely to increase the memory size and overall communications capability of one of its newer systems: Shelton.

With a full 128 GB local memory available on a single 32-processor node, large memory jobs not only maintain a home at the ARL MSRC but are supported by one of the premier HPC processors in the 1.7 GHz Power4+. Maintaining this balance in sheer processor speed and shared memory capability is key in the overall center management. Although the SGI Origin 2000 systems have been phased out this year to make room for the new commodity-based systems, maintaining the Origin 3800 systems and upgrades to the Power4+ system will provide the continued support necessary for large memory, minimal scalability jobs.

Balance is also key in support systems and resources. With the massive influx of compute power, networking, storage and connectivity will be key to providing the consistent, balanced support ARL MSRC users have come to expect. TI-04 also brings revitalization to the overall network infrastructure and mass storage systems. Plans are being finalized for an overall backbone upgrade to the new 10 GB Ethernet technology. GigaBit Ethernet has been the central backbone since 2001, but with recent productization of the new 10 GB technology, all central routers and switches will be upgraded to support the greater speeds available. With these upgrades also comes a consolidation and simplification of overall networking resources. Consistency in all environments is always a key factor in upgrade decision making, thus we will be improving the network infrastructure in all compute environments as well as the external interfaces to DREN and SDREN.

Following large, scalable compute systems, naturally, is the need for a high-capacity, high-bandwidth data management infrastructure. The aging Sun E6500 file servers and tape drive infrastructure were upgraded as part of the TI-03 cycle. These upgrades were in advance and in anticipation of the huge increase in overall compute power expected next summer. The dual E6500 systems were replaced this fall by dual Sun E15000 systems, with twice the compute power and memory. These systems, feeding into the higher density, higher tape speed drives installed from StorageTek last fall, will provide the necessary balance in overall I/O capability required at the center.

Facility modifications

In order to accommodate the sheer size of these new systems, the facilities at the ARL MSRC are going through significant changes and expansion. With systems on the order of 2,000 processors, heat dissipation becomes a serious problem. As part of the facilities upgrades to support these new systems, new 22-ton air conditioning systems will be installed in the compute environments. Additional power panels and power distribution units are also needed to handle the additional electrical loads.

Finally, in the unclassified environment, the facilities planning team have formulated a plan to expand the computer room into adjacent spaces, gaining approximately 2,200 square feet of usable space for compute equipment.

On the classified side, airflow has become an issue with the warmer IBM Power4 systems. In order to facilitate better air recycling through the air handler units themselves, the drop ceiling will be raised to the maximum height allowed by the physical ceiling in the room. By raising the ceiling, expelled hot air will not only be able to escape the compute racks themselves more quickly but will be returned to the top-fed return air plumes on the air handlers along the exterior walls. These facilities modifications will support the TI-04 equipment.

The Department of the Army is considering plans for a single consolidated complex to house all ARL MSRC assets and staff. Early indications would show ground breaking beginning sometime after the year 2007.

(Ed.'s note: Mike McCraney left the ARL MSRC in March to become the Director of Operations at the Maui High Performance Computing Center.)