Skip Nav

SGI Altix ICE 8200 (Harold)
User Guide

Table of Contents

1. Introduction

1.1. Document Scope and Assumptions

This document provides an overview and introduction to the use of the SGI Altix ICE 8200LX (Harold) located at the ARL DSRC,along with a description of the specific computing environment on Harold. The intent of this guide is to provide information that will enable the average user to perform computational tasks on the system. To receive the most benefit from the information provided here, you should be proficient in the following areas:

  • Use of the UNIX operating system
  • Use of an editor (e.g., vi or emacs)
  • Remote usage of computer systems via network or modem access
  • A selected programming language and its related tools and libraries

1.2. Policies to Review

Users are expected to be aware of the following policies for working on Harold.

1.2.1. Login Node Abuse Policy

Memory or CPU intensive programs running on the login nodes can significantly affect all users of the system. Therefore, only small applications requiring less than 10 minutes of runtime and less than 2 GBytes of memory are allowed on the login nodes. Any job running on the login nodes that exceeds these limits may be unilaterally terminated.

1.2.2. Workspace Purge Policy

The /usr/var/tmp directory is subject to a fifteen-day purge policy. A system "scrubber" monitors scratch space utilization, and if available space becomes low, files not accessed within fifteen days are subject to removal, although files may remain longer if the space permits. There are no exceptions to this policy.

1.3. Obtaining an Account

The process of getting an account on the HPC systems at any of the DSRCs begins with getting an account on the HPCMP Portal to the Information Environment, commonly called a "pIE User Account." If you do not yet have a pIE User Account, please visit HPC Centers: Obtaining An Account and follow the instructions there. Once you have an active pIE User Account, visit the ARL DSRC accounts page for instructions on how to request accounts on the ARL DSRC HPC systems. If you need assistance with any part of this process, please contact CCAC at

1.4. Requesting Assistance

The Consolidated Customer Assistance Center (CCAC) is available to help users with unclassified problems, issues, or questions. Analysts are on duty 8:00 a.m. - 11:00 p.m. Eastern, Monday - Friday (excluding Federal holidays).

You can contact the ARL DSRC directly in any of the following ways for support services not provided by CCAC:

For more detailed contact information, please see our Contact Page.

2. System Configuration

2.1. System Summary

Harold is an SGI Altix ICE 8200. The login and compute nodes are populated with dual Intel Xeon Nehalem-EP quad-core processors. Harold uses the 4X DDR Infiniband as its high-speed network for MPI messages and IO traffic. Harold uses Lustre to manage its parallel file system that targets SGI's IS4600 (Infinite Storage) RAID arrays. Harold has 1,344 compute nodes that share memory only on the node; memory is not shared across the nodes. Each compute node has two quad-core processors (8 cores) with its own SLES 11 SP1 operating system, sharing 24 GBytes of DDR3 memory, with no user-accessible swap space. Harold is rated at 109.3 peak TFLOPS and has 495 TBytes (formatted) of disk storage.

Harold is intended to be used as a batch-scheduled HPC system. Its login nodes are not to be used for large computational (e.g., memory, IO, long executions) work. All executions that require large amounts of system resources must be sent to the compute nodes by batch job submission.

Node Configuration
Login Nodes Compute Nodes
Total Nodes 8 1344
Operating System SLES 11 SP1 SLES 11 SP1
Cores/Node 8 8
Core Type Intel Xeon quad-core Nehalem Intel Xeon quad-core Nehalem
Core Speed 2.8 GHz 2.8 GHz
Memory/Node 48 GBytes 24 GBytes
Accessible Memory/Node 2 GBytes 20 GBytes
Memory Model Shared on node. Shared on node.
Distributed across cluster.
Interconnect Type 4x DDR Infiniband 4x DDR Infiniband
File Systems on Harold
Path Capacity Type
/usr/var/tmp495 TBytes *Lustre
/usr/people23 TBytesLustre
/usr/cta495 TBytes *Lustre
/archive16 TBytesNFS

* /usr/var/tmp and /usr/cta share the same locally mounted Lustre file system.

2.2. Processors

Harold uses 2.8-GHz Intel Xeon X5560 Nehalem-EP processors on its login and compute nodes. There are 2 processors per node, each with 4 cores, for a total of 8 cores per node. In addition, these processors have 4x256 KBytes of L2 cache, and 8 MBytes of L3 cache.

2.3. Memory

Harold uses both shared- and distributed-memory models. Memory is shared among all the cores on a node, but is not shared among the nodes across the cluster.

Each login node contains 48 GBytes of main memory. All memory and cores on the node are shared among all users who are logged in. Therefore, users should not use more than 2 GBytes of memory at any one time.

Each compute node contains 20 GBytes of user-accessible shared memory.

2.4. Operating System

The operating system on Harold is SUSE Linux Enterprise Server (SLES) 11 with a 2.6-based kernel. The operating system supports 64-bit software.

2.5. File Systems

Harold has the following file systems available for user storage:

2.5.1. /usr/people

This file system is locally mounted from Harold's Lustre file system. It has a formatted capacity of 23 TBytes. All users have a home directory located on this file system which can be referenced by the environment variable $HOME.

2.5.2. /usr/var/tmp and /usr/cta

These directories share Harold's locally mounted Lustre file system. It has a formatted capacity of 495 TBytes. All users have a work directory located on /usr/var/tmp which can be referenced by the environment variable $WORKDIR . All center-managed COTS packages are stored in /usr/cta. In addition, users may request space in this area under /usr/cta/unsupported to store user-managed software packages that they wish to make available to other owner-designated users. To have space allocated in /usr/cta/unsupported a user should submit a request to the ARL DSRC helpdesk, either through CCAC or by directly contacting the ARL DSRC helpdesk. Requests will be processed on a first-come, first-served basis.

2.5.3. /archive

This NFS mounted file system is accessible from the login nodes on Harold. Files in this file system are subject to migration to tape and access may be slower due to the overhead of retrieving files from tape. It has a formatted capacity of 16 TBytes with 6.6 PBytes of archival tape storage. The disk portion of the file system is automatically backed up. Users should migrate all large input and output files to this area for long-term storage. Users should also migrate all important smaller files from their home directory area in /usr/people to this area for long-term storage. All users have a directory located on this file system which can be referenced by the environment variable $ARCHIVE_HOME.

2.5.4. /tmp or /var/tmp

Never use /tmp or /var/tmp for temporary storage! These directories are not intended for temporary storage of user data, and abuse of these directories could adversely affect the entire system.

2.5.5. /p/cwfs

This path is directed to the Center-Wide File System (CWFS) which is meant for short-term storage (no longer than 30 days). All users have a directory defined in this file system. The environment variable for this is $CENTER. This is accessible from both the Utility Server login and compute nodes and the HPC systems login nodes. The CWFS has a formatted capacity of 800 TBytes and is managed by Panasas PanFS.

2.5.6. Raid/Striping Concerns for Large Files

The $WORKDIR and /usr/people have a default stripe count of 1 stripe. Also, both the $WORKDIR and /usr/people are targeted to RAID 5 LUNs.

It is important to note that /usr/var/tmp is a parallel (multiple-striped) file system. This means that as files are written, they are automatically divided into chunks and written across multiple LUNs or "OSTs," simultaneously. This process, called "striping," plays a vital role in running large jobs, because it significantly improves file I/O rates of reads or writes. Without parallel striping, large jobs, many of which require hundreds of GBytes of disk space, would spend much of their time just reading from and writing to disk. Users planning to create files greater than 80 GBytes should increase the stripe count on that file or the directory that will hold that file before writing to that file. Go to $SAMPLES_HOME/OST_Stripes on Harold for information on how to increase stripe counts.

2.6. Peak Performance

Harold is rated at 109.3 peak TFLOPS.

3. Accessing the System

3.1. Kerberos

A Kerberos client kit must be installed on your desktop to enable you to get a Kerberos ticket. Kerberos is a network authentication tool that provides secure communication by using secret cryptographic keys. Only users with a valid HPCMP Kerberos authentication can gain access to Harold. More information about installing Kerberos clients on your desktop can be found at HPC Centers: Kerberos & Authentication.

3.2. Logging In

  • Kerberized SSH
    % ssh
  • Kerberized rlogin and telnet are also allowed.

3.3. File Transfers

File transfers to ARL DSRC systems must be performed using Kerberized versions of any of the following tools: scp, ftp, sftp, and mpscp, except file transfers to the local archive system.

If you need to transfer files to/from other Kerberized systems within the program, you should use kftp, krcp, sftp, mpscp, or /usr/brl/bin/scp.

If you use scp often, you might want to add an alias in your .cshrc.pers or .profile.pers files in $HOME, as follows:

alias scp /usr/brl/bin/scp # .cshrc.pers - csh/tcsh
alias scp=/usr/brl/bin/scp # .profile.pers - sh/ksh/bash

If you want to use mpscp, you should use the "-S /usr/brl/bin/ssh" option to identify the version of ssh to be used. If you use mpscp often, you may want to add an alias in your .cshrc.pers or .profile.pers files in $HOME, as follows:

alias mpscp "mpscp -S /usr/brl/bin/ssh" # .cshrc.pers - csh/tcsh
alias mpscp="mpscp -S /usr/brl/bin/ssh" # .profile.pers - sh/ksh/bash

4. User Environment

4.1. User Directories

The following user directories are provided for all users on Harold.

4.1.1. Home Directory

When you log on to Harold, you will be placed in your home directory, /usr/people/username. The environment variable $HOME is automatically set for you and refers to this directory. $HOME is visible to both the login and compute nodes and may be used to store small user files, but it has limited capacity and is not backed up; therefore, it should not be used for long-term storage.

4.1.2. Work Directory

The path for your working directory on Harold's scratch file system is /usr/var/tmp/username. The environment variable $WORKDIR is automatically set for you and refers to this directory. $WORKDIR is visible to both the login and compute nodes, and should be used for temporary storage of active data related to your batch jobs.

Note: Although the $WORKDIR environment variable is automatically set for you, the directory itself is not created. You can create your $WORKDIR directory as follows:

mkdir $WORKDIR

The scratch file system provides 495 TBytes of formatted disk space. This space is not backed up, however, and is subject to a purge policy.

REMEMBER: This file system is considered volatile working space. You are responsible for archiving any data you wish to preserve. To prevent your data from being "scrubbed," you should copy files that you want to keep into your /archive directory (see below) for long-term storage.

4.1.3. /archive Directory

In addition to $HOME and $WORKDIR, each user is also given a directory on the /archive file system. This file system is visible to the login nodes (not the compute nodes) and is the preferred location for long-term file storage. All users have an area defined in /archive for their use. This area can be accessed using the $ARCHIVE_HOME environment variable. We recommend that you keep large computational files and more frequently accessed files in the $ARCHIVE_HOME directory. We also recommend that any important files located in $HOME should be copied into $ARCHIVE_HOME as well.

Because the compute nodes are unable to see $ARCHIVE_HOME, you will need to pre-stage your input files to your $WORKDIR from a login node before submitting jobs. After jobs complete, you will need to transfer output files from $WORKDIR to $ARCHIVE_HOME from a login node. This may be done manually or through the transfer queue, which executes serial jobs on login nodes.

4.1.4. Center-Wide File System Directory

The path for your working directory on the Center-Wide file system is /p/cwfs/username. The environment variable $CENTER is automatically set to point to this directory. The main purpose of this area is as a staging area for production system output files that require post-processing using the Utility Server.

Because the compute nodes are unable to see /p/cwfs on Harold, you will need to transfer output files from $WORKDIR to /p/cwfs from a login node. This may be done manually or through the transfer queue, which executes serial jobs on login nodes.

4.2. Shells

The following shells are available on Harold: csh, bash, ksh, tcsh, and sh. To request a change of your default shell, contact the Consolidated Customer Assistance Center (CCAC).

4.3. Environment Variables

A number of environment variables are provided by default on all HPCMP HPC systems. We encourage you to use these variables in your scripts where possible. Doing so will help to simplify your scripts and reduce portability issues if you ever need to run those scripts on other systems.

4.3.1. Login Environment Variables

The following environment variables are common to both the login and batch environments:

Common Environment Variables
Variable Description
$ARCHIVE_HOME Your directory on the archive server
$ARCHIVE_HOST The host name of the archive server
$BC_HOST The generic (not node specific) name of the system.
$CC The currently selected C compiler. This variable is automatically updated when a new compiler environment is loaded.
$CENTER Your directory on the Center-Wide File System (CWFS)
$COST_HOME This variable contains the path to the base directory of the default installation of the Common Open Source Tools (COST) installed on a particular compute platform. (See BC policy FY13-01 for COST details.)
$CSI_HOME The directory containing the following list of heavily used application packages: ABAQUS, Accelrys, ANSYS, CFD++, Cobalt, EnSight, Fluent, GASP, Gaussian, LS-DYNA, MATLAB, and TotalView, formerly known as the Consolidated Software Initiative (CSI) list. Other application software may also be installed here by our staff.
$CXX The currently selected C++ compiler. This variable is automatically updated when a new compiler environment is loaded.
$DAAC_HOME The directory containing the ezViz visualization software
$F77 The currently selected Fortran 77 compiler. This variable is automatically updated when a new compiler environment is loaded.
$F90 The currently selected Fortran 90 compiler. This variable is automatically updated when a new compiler environment is loaded.
$HOME Your home directory on the system
$JAVA_HOME The directory containing the default installation of JAVA
$KRB5_HOME The directory containing the Kerberos utilities
$PET_HOME The directory containing the tools formerly installed and maintained by the PETTT staff. This variable is deprecated and will be removed from the system in the future. Certain tools will be migrated to $COST_HOME, as appropriate.
$PROJECTS_HOME A common directory where group-owned and supported applications and codes may be maintained for use by members of a group. Any project may request a group directory under $PROJECTS_HOME.
$SAMPLES_HOME The Sample Code Repository. This is a collection of sample scripts and codes provided and maintained by our staff to help users learn to write their own scripts. There are a number of ready-to-use scripts for a variety of applications.
$WORKDIR Your work directory on the local temporary file system (i.e., local high-speed disk).
4.3.2. Batch-Only Environment Variables

In addition to the variables listed above, the following variables are automatically set only in your batch environment. That is, your batch scripts will be able to see them when they run. These variables are supplied for your convenience and are intended for use inside your batch scripts.

Batch-Only Environment Variables
Variable Description
$BC_CORES_PER_NODE The number of cores per node for the compute node on which a job is running.
$BC_MEM_PER_NODE The approximate maximum user-accessible memory per node (in integer MBytes) for the compute node on which a job is running.
$BC_MPI_TASKS_ALLOC The number of MPI tasks allocated for a job.
$BC_NODE_ALLOC The number of nodes allocated for a job.

4.4. Modules

Software modules are a very convenient way to set needed environment variables and include necessary directories in your path so commands for particular applications can be found. We strongly encourage you to use modules. For more information on using modules, see the Modules User Guide.

4.5. Archive Usage

Archive storage is provided through the /archive NFS-mounted file system. All users are automatically provided a directory under this file system. However, it is only accessible from the login nodes. Since space in a user's login home area in /usr/people is limited, all large data files requiring permanent storage should be placed in /archive. Also, it is recommended that all important smaller files in /usr/people for which a user requires long-term access be copied to /archive as well. For more information on using the archive system, see the Archive System User's Guide.

4.6. Login Files

When an account is created on Harold, a default .cshrc, and/or .profile file is placed into your home directory. This file contains the default modules setup to configure modules, PBS and other system defaults. We suggest you customize the following: .cshrc.pers, or .profile.pers for your shell with any paths, aliases or libraries you may need to load. The files should be sourced at the end of your .cshrc, and/or .profile file as necessary. For example:

if (-f $HOME/.cshrc.pers) then
source $HOME/.cshrc.pers

If you need to connect to other Kerberized systems within the program, you should use ktelnet, krlogin, or /usr/brl/bin/ssh. If you use Kerberized ssh often, you may want to add an alias in your .cshrc.pers, or .profile.pers files in $HOME, as follows:

alias ssh /usr/brl/bin/ssh # .cshrc.pers - csh/tcsh
alias ssh=/usr/brl/bin/ssh # .profile.pers - sh/ksh/bash

5. Program Development

5.1. Programming Models

Harold supports three programming models: Message Passing Interface (MPI), SHared-MEMory (SHMEM), and Open Multi-Processing (OpenMP). A Hybrid MPI/OpenMP programming model is also supported. MPI and SHMEM are examples of message- or data-passing models. OpenMP only uses shared memory on a node by spawning threads. And, the hybrid model combines both models.

5.1.1. Message Passing Interface (MPI)

Harold has an MPI-1.2 standard library provided by SGI that is tuned for Altix ICE systems. The module for this MPI library is mpi/sgi_mpi-x.xx. Harold also has an MPI-2.0 standard library that is a part of the Intel Software Development suite. The module for this MPI library is mpi/intelmpi-x.x. In addition, the OpenMPI MPI library is available. The module for this MPI library is mpi/openmpi-x.x.

5.1.2. SHared MEMory (SHMEM)

The SGI MPI package supports the SHMEM programming model. These logically shared- and distributed-memory access routines provide high-performance, high-bandwidth communication for use in highly parallelized scalable programs. The SHMEM data-passing library routines are similar to the MPI library routines: they pass data between cooperating parallel processes. The SHMEM data-passing routines can be used in programs that perform computations in separate address spaces and that explicitly pass data to and from different processes in the program.

The SHMEM routines minimize the overhead associated with data-passing requests, maximize bandwidth, and minimize data latency. Data latency is the length of time between a process initiating a transfer of data and that data becoming available for use at its destination.

SHMEM routines support remote data transfer through put operations that transfer data to a different process and get operations that transfer data from a different process. Other supported operations are work-shared broadcast and reduction, barrier synchronization, and atomic memory updates. An atomic memory operation is an atomic read and update operation, such as a fetch and increment, on a remote or local data object. The value read is guaranteed to be the value of the data object just prior to the update. See "man intro_shmem" for details on the SHMEM library.

When creating a SHMEM program on Harold, ensure that the following actions are taken:

  • Make sure that an SGI MPI module is loaded. To check this, run the "module list" command.
  • The source code includes one of the following lines:

    INCLUDE 'mpp/shmem.fh' ## for Fortran, or
    #include <mpp/shmem.h> ## for C

  • The compile command includes an option to reference the SHMEM library.

To compile a SHMEM program, use the following examples:

ifort -lsma -lmpi -o shmem_program shmem_program.f90 ## for Fortran, or
icc -lsma -lmpi -o shmem_program shmem_program.c ## for C

Before running a SHMEM program, you should set the $MPI_DSM_DISTRIBUTE environment variable to "yes", as shown below. This variable optimally assigns processes to cores and improves overall memory use.


To run your program within a batch script, use the mpiexec_mpt command as follows:

mpiexec_mpt -np N shmem_program [user_arguments]

where N is the number of processes being started, with each process utilizing one core. The mpiexec_mpt command launches executables across a set of compute nodes. When each member of the parallel application has exited, mpiexec_mpt exits. For more information about mpiexec_mpt, see the mpiexec_mpt man page.

5.1.3. Open Multi-Processing (OpenMP)

OpenMP is available in Intel's Software Development suite for C, C++ and Fortran. Use the "-openmp" flag.

5.1.4. Hybrid Processing (MPI/OpenMP)

In hybrid processing, all intranode parallelization is accomplished using OpenMP, while all internode parallelization is accomplished using MPI. Typically, there is one MPI task assigned per node, with the number of OpenMP threads assigned to each node set at the number of cores available on the node.

5.2. Available Compilers

Harold has two compiler suites:

  • Intel
  • GNU

Both the Intel and SGI versions of MPI share a common base set of compilers that are available on both the login and compute nodes:

Common Compiler Commands
Fortran 77ifortgfortranSerial/Parallel
Fortran 90ifortN/ASerial/Parallel

The following additional compiler wrapper scripts are available only under Intel MPI and OpenMPI:

Intel MPI and OpenMPI Compiler Wrapper Scripts
MPI CmpiccmpiccParallel
MPI C++mpiccmpiccParallel
MPI F77mpif77mpif77Parallel
MPI F90mpif90mpif90Parallel

To select one of these compilers for use, load its associated module. See Relevant Modules (below) for more details.

5.2.1. Intel C, C++, and Fortran Compiler

Intel's latest compiler suite improves performance for large memory and F90 applications over the previous version of this product. Intel's latest Fortran compiler, ifort, includes the code-generation and optimization power of the Intel compiler and the language features of the Compaq Visual Fortran front-end. The standard Intel Fortran compiler tools continue to be available as well. The latest Intel C++ compiler now has a full binary mix and match operability with gcc 3.2 and greater. The compiler also includes support for the gcc Standard Template Library (libstdC++) and allows precompiled headers for Linux compilation.

Several optimizations and tuning options are available for code developed with all Intel compilers on the Nehalem quad-core processor. For more information see Code Profiling and Optimization. The table below shows some compiler options that may help with optimization.

Useful Intel Compiler Options
-O0 disable optimization
-g create symbols for tracing and debugging
-O1 optimize for speed with no loop unrolling and no increase in code size
-O2 or -default default optimization, optimize for speed with inline intrinsic and loop unrolling
-O3 level -O2 optimization plus memory optimization (allows compiler to alter code)
-ipo interprocedural optimization, inline functions in separate files, partial inlining, dead code elimination, etc.

The following tables contain examples of serial, MPI, and OpenMP compile commands for C, C++ and Fortran.

Example C Compile Commands
Programming ModelCompile Command
Serialicc -O3 my_code.c -o my_code.x
SGI MPIicc -O3 my_code.c -o my_code.x -lmpi
OpenMPicc -O3 my_code.c -o my_code.x -openmp
Example C++ Compile Commands
Programming ModelCompile Command
Serialicc -O3 my_code.C -o my_code.x
SGI MPIicc -O3 my_code.C -o my_code.x -lmpi -lmpi++
OpenMPicc -O3 my_code.C -o my_code.x -openmp
Example Fortran Compile Commands
Programming ModelCompile Command
Serialifort -O3 my_code.f90 -o my_code.x
SGI MPIifort -O3 my_code.f90 -o my_code.x -lmpi
OpenMPifort -O3 my_code.f90 -o my_code.x -openmp

For more information on the Intel compilers, please consult Intel's Software Documentation Library.

5.2.2. GNU Compiler

The default GNU compilers are good for compiling utility programs, but are probably not appropriate for computationally intensive applications. The primary selling point of using GNU compilers is the compatibility between different architectures. The GNU compilers are available when the compiler/gcc/4.4 module is loaded. Once the module is loaded, they can be executed using the compiler commands in the table above. For GNU compilers, the "-O" flag is the basic optimization setting.

More GNU compiler information can be found in the GNU gcc 4.4.2 manual.

5.3. Relevant Modules

If you compile your own codes, you will need to select which compiler and MPI version you want to use. Load the compiler module first, load the MPI module, and compile. When you execute your program, load the same pair of modules.

Harold provides individual modules for each compiler and MPI version. To see the list of currently available modules use the "module avail" command. You can use any of the available MPI versions with each compiler by pairing them together when you load the modules. For example:

module load compiler/intel/11.1 mpi/sgi_mpi/1.26
module load compiler/gcc/4.5 mpi/sgi_mpi/1.26

The table below shows the naming convention used for various modules.

Module Naming Conventions
Module Module Name
Intel Compilerscompiler/intel/##.#
SGI MPI Librarympi/sgi_mpi/#.##
Intel MPI Librarympi/intelmpi/#.#
OpenMPI MPI Librarympi/openmpi/#.#

For more information on using modules, see the Modules User Guide.

5.4. Libraries

5.4.1. BLAS

The Basic Linear Algebra Subprogram (BLAS) library is a set of high quality routines for performing basic vector and matrix operations. There are three levels of BLAS operations:

  • BLAS Level 1: vector-vector operations
  • BLAS Level 2: matrix-vector operations
  • BLAS Level 3: matrix-matrix operations

More information on the BLAS library can be found at

5.4.2. Intel Math Kernel Library (MKL)

The Intel Math Kernel Library (MKL) is a library of numerical processing functions that have been optimized for math, scientific and engineering applications. The MKL includes the following:

  • LAPACK plus BLAS (Levels 1, 2, 3)
  • Discrete Fourier Transforms (DFTs)
  • Vector Statistical Library functions (VSL)
  • Vector Transcendental Math functions (VML)

The MKL can be loaded into your path using the following command:

module load compiler/intel11.1

Add "-L $MKLPATH -l library_name" to the compilation options on your code to use these libraries. The $MKLPATH environment variable is set when the module is loaded. More information on Intel's MKL can be found at

5.4.3. Additional Math Libraries

There is also an extensive set of Math libraries available in the $PET_HOME/MATH directory on Harold. Information about these libraries may be found on the Baseline Configuration Web site at BC policy FY06-01.

5.5. Debuggers

5.5.1. gdb

The GNU Project Debugger (gdb) is a debugger that works similarly to dbx and can be invoked either with a program for execution or a running process id. To use gdb to debug a program during execution, use:

gdb a.out corefile

To debug a process that is currently executing on this node, use:

gdb a.out pid

For more information, the GDB manual can be found at

5.5.2. idb

The Intel Debugger (idb) is a symbolic debugger that implements a stop and examine model to help locate run-time errors in code. It can also attach to running processes to perform kernel debugging, and it has the ability to manage several processes at once as well as multi-threaded applications. To use idb, the code to be debugged must be compiled and linked with the "-Od", "-Oy" and "-Zi" options. By default, idb begins in dbx mode, but can be run in gdb mode by specifying the "-gdb" option. A graphical version of idb can be invoked using the "-gui" option. The Intel Debugger Manual can be found at .

Note: the user must first load the Intel module to access IDB.

5.5.3. TotalView

TotalView is a debugger that supports threads, MPI, OpenMP, C/C++, and Fortran, mixed-language codes, advanced features like on-demand memory leak detection, other heap allocation debugging features, and the Standard Template Library Viewer (STLView). Unique features like dive, a wide variety of breakpoints, the Message Queue Graph/Visualizer, powerful data analysis, and control at the thread level are also available.

Currently on Harold, to display the source code, you must limit your debug job to 1 node (8 cores). Debug jobs using multiple nodes will display only assembler instructions.

Follow these steps to use TotalView on Harold via a UNIX X-Windows interface:

  1. Open a console window to Harold. "ssh harold-l[1-7]" , where you select a number from 1 to 7.
  2. Start an 8-core, interactive session:

    qsub -V -I -X -N iHarold -q standard -l walltime=07:00:00 -l select=1:ncpus=8:mpiprocs=8 -l place=scatter:excl -A Your_Project_Id -r n -j oe

    Remember to provide a valid project id in the line above. Wait for a compute node to be given to you. You'll see a new prompt similar to "1 r11i3n12". The numbers after the r,i and n will be different.

  3. Once you get an interactive compute node(s), enter the following line in that compute node's console:

    env | grep -i PBS_ | sed -e 's/^/setenv &/g' -e 's/\=/ /g' > sourceme.csh

    (Watch for the placement of the single ticks, " ' ".)

  4. Start a NEW Harold session with X-forwarding turned on: "ssh -X harold-l[1-7]" , where you select a number from 1 to 7.
  5. Once you are logged in to a Harold login node, perform ANOTHER "ssh -X" forwarding to the interactive node given to you in the step above. In this example it is r11i3n12, so on harold-l1 you should type "ssh -X r11i3n12".
  6. To make sure you can open an X-window, type "xclock". A clock should be displayed in your window. It may take a second or two, so be patient. If you do not see a clock, then something went wrong with the X-port forwarding steps above.
  7. In the X-port forward console window (the second console window that is opened), type "source sourceme.csh".
  8. Load the modules that were utilized to compile your code and load the TotalView module.

    module load compiler/intel11.1 mpi/sgi_mpi-1.26 totalview

  9. Now start TotalView: type "totalview" and wait a minute or so for the TotalView windows to pop up.
  10. Under the TotalView Window named "New Program" select "Browse" button and select your program executable.
  11. Click the "Parallel" tab and select MPT for SGI-MPI code (or "Intel MPI" or "Open MPI" if your code was compiled with those MPI "flavors").
  12. In the same tab, click the "up" arror on the "Tasks" to 8. This will allow an 8-MPI-task job. Also, increase the node button to "1" from "0".
  13. Click OK. Your source code should pop up allowing you to enter stop points, watchpoints, etc.

If you are using Cygwin, please log onto Harold and cd to /usr/cta/SCR/MPI_Totalview to view the appropriate document for your system.

For more information on using TotalView, see the TotalView Documentation page.

5.6. Code Profiling and Optimization

Profiling is the process of analyzing the execution flow and characteristics of your program to identify sections of code that are likely candidates for optimization, which increases the performance of a program by modifying certain aspects for increased efficiency.

We provide two profiling tools: gprof and codecov to assist you in the profiling process. A basic overview of optimization methods with information about how they may improve the performance of your code can be found in Performance Optimization Methods (below).

5.6.1. gprof

The GNU Project Profiler (gprof) is a profiler that shows how your program is spending its time and which functions calls are made. To profile code using gprof, use the "-pg" option during compilation. For more information, the gprof manual can be found at

5.6.2. codecov

The Intel Code Coverage Tool (codecov) can be used in numerous ways to improve code efficiency and increase application performance. The tool leverages Profile-Guided optimization technology (discussed below). Coverage can be specified in the tool as file-level, function-level or block-level. Another benefit to this tool is the ability to compare the profiles of two application runs to find where the optimizations are making a difference. More detailed information on this tool can be found at

5.6.3. Program Development Reminders

If an application is not programmed for distributed memory, then only the cores on a single node can be used. This is limited to 8 cores on Harold.

Check the utilization of the nodes your application is running on to see if it is taking advantage of all the resources available to it. This can be done by finding the nodes assigned to your job by executing "qstat -f", logging into one of the nodes using the ssh command, and then executing the top command to see how many copies of your executable are being executed on the node.

Keep the system architecture in mind during code development. For instance, if your program requires more memory than is available on a single node, then you will need to parallelize your code so that it can function across multiple nodes.

5.6.4. Performance Optimization Methods

Optimization generally increases compilation time and executable size, and may make debugging difficult. However, it usually produces code that runs significantly faster. The optimizations that you can use will vary depending on your code and the system on which you are running.

Note: Before considering optimization, you should always ensure that your code runs correctly and produces valid output.

In general, there are five main categories of optimization:

  • Global Optimization
  • Loop Optimization
  • Interprocedural Analysis and Optimization(IPA)
  • Function Inlining
  • Profile-Guided Optimizations
Global Optimization

A technique that looks at the program as a whole and may perform any of the following actions:

  • Performed on code over all its basic blocks
  • Performs control-flow and data-flow analysis for an entire program
  • Detects all loops, including those formed by IF and GOTOs statements and performs general optimization.
  • Constant propagation
  • Copy propagation
  • Dead store elimination
  • Global register allocation
  • Invariant code motion
  • Induction variable elimination
Loop Optimization

A technique that focuses on loops (for, while, etc.) in your code and looks for ways to reduce loop iterations or parallelize the loop operations. The following types of actions may be performed:

  • Vectorization - rewrites loops to improve memory access performance. With the Intel compilers, loops can be automatically converted to utilize the MMX/SSE/SSE2/SSE3 instructions and registers if they meet certain criteria.
  • Loop unrolling - (also known as "unwinding") replicates the body of loops to reduce loop branching overhead and provide better opportunities for local optimization.
  • Parallelization - divides loop operations over multiple processors where possible.
Interprocedural Analysis and Optimization (IPA)

A technique that allows the use of information across function call boundaries to perform optimizations that would otherwise be unavailable.

Function Inlining

A technique that seeks to reduce function call and return overhead.

  • Used with functions that are called numerous times from relatively few locations.
  • Allows a function call to be replaced by a copy of the body of that function.
  • May create opportunities for other types of optimization
  • May not be beneficial. Improper use may increase code size and actually result in less efficient code.
Profile-Guided Optimizations

Profile-Guided optimizations are available which allow the compiler to make data driven decisions during compilation on branch predictions, increased parallelism, block ordering, register allocation, function ordering, and more. The build for this option takes about three steps though and uses a representative data set to come up with the optimizations.

For example:

  • Step 1: Instrumentation, Compilation, and Linking

    ifort -prof-gen -prof-dir ${HOME}/profdata -O2 -c a1.f a2.f a3.f
    ifort -o a1 a1.o a2.o a3.o

  • Step 2: Instrumentation Execution


  • Step 3: Feedback Compilation

    ifort -prof-use -prof-dir ${HOME}/profdata -ipo a1.f a2.f a3.f

6. Batch Scheduling

6.1. Scheduler

The Portable Batch System (PBS) is currently running on Harold. It schedules jobs and manages resources and job queues, and can be accessed through the interactive batch environment or by submitting a batch request. PBS is able to manage both single-processor and multiprocessor jobs. The PBS module is automatically loaded by the Master module on Harold at login.

6.2. Queue Information

The following table describes the PBS queues available on Harold:

Summary of Queues on Harold
Priority Queue
Max Wall
Clock Time
Max Cores
Per Job
Highest debug Debug 1 Hour N/A User diagnostic jobs
Down Arrow for decreasing priority transfer N/A 24 Hours 1 Data transfer for user jobs
urgent Urgent 96 Hours N/A Designated urgent jobs by DoD HPCMP
staff N/A 368 Hours N/A ARL DSRC staff testing only. System testing and user support.
high High 96 Hours N/A Designated high-priority jobs by DoD HPCMP
challenge Challenge 168 Hours N/A Challenge projects only
cots Standard 96 Hours N/A Abaqus, Fluent, and Cobalt jobs
interactive Standard 12 Hours N/A Interactive jobs
standard-long Standard 200 Hours N/A ARL DSRC permission required
standard Standard 96 Hours N/A Non-Challenge user jobs
Lowest background Background 24 Hours N/A User jobs that will not be charged against the project allocation.

6.3. Interactive Logins

When you log in to Harold, you will be running in an interactive shell on a login node. The login nodes provide login access for Harold and support such activities as compiling, editing, and general interactive use by all users. Please note the Login Node Abuse policy. The preferred method to run resource intensive executions is to use an interactive batch session.

6.4. Interactive Batch Sessions

An interactive session on a compute node is possible using a proper PBS command line syntax from a login node. Once PBS has scheduled your request on the compute pool, you will be directly logged into a compute node, and this session can last as long as your requested wall time.

To submit an interactive batch job, use the following submission format:

qsub -I -l walltime=HH:MM:SS -l select=#_of_nodes:ncpus=8:mpiprocs=8 (line continues...)
-l place=scatter:excl -A proj_id -q interactive -V

Your batch shell request will be placed in the interactive queue and scheduled for execution. This may take a few minutes or a long time depending on the system load. Once your shell starts, you will be logged into the first compute node of the compute nodes that were assigned to your interactive batch job. At this point, you can run or debug applications interactively, execute job scripts, or start executions on the compute nodes you were assigned.

6.5. Batch Request Submission

PBS batch jobs are submitted via the qsub command. The format of this command is:

qsub [ options ] batch_script_file

qsub options may be specified on the command line or embedded in the batch script file by lines beginning with "#PBS".

For a more thorough discussion of PBS Batch Submission, see the Harold PBS Guide.

6.6. Batch Resource Directives

A listing of the most common batch Resource Directives is available in the Harold PBS Guide.

6.7. Launch Commands

There are different commands for launching MPI executables from within a batch job depending on which MPI implementation your script uses.

To launch an SGI MPI executable, use the mpiexec_mpt command as follows:

mpiexec_mpt -np #_of_cores ./mympijob.exe

To launch an Intel MPI executable, use the intelmpirun.pbs command as follows:

intelmpirun.pbs ./mympijob.exe

To launch an OpenMPI MPI executable, use the openmpirun.pbs command as follows:

openmpirun.pbs ./mympijob.exe

For OpenMP executables, no launch command is needed.

6.8. Sample Scripts

The following script is a basic example. More thorough examples are available in the Harold PBS Guide and in the Sample Code Repository ($SAMPLES_HOME) on Harold.

#  Specify job name.
#PBS -N myjob

#  Specify queue name.
#PBS -q standard

# select = # of nodes
# ncpus is ALWAYS set to 8!
# mpiprocs is the number of cores on each node to use
# This run will use (select)x(mpiprocs) cores = 8*8=64 cores
#PBS -l select=8:ncpus=8:mpiprocs=8

#  Specify how MPI processes should be distributed across nodes.
#PBS -l place=scatter:excl

#  Specify maximum wall clock time.
#PBS -l walltime=24:00:00

#  Specify Project ID to use. ID may have the form ARLAP96090RAY.

#  Specify that environment variables should be passed to master MPI process.

set JOBID=`echo #PBS_JOBID | cut -f1 d.`

#  Create a temporary working directory within $WORKDIR for this job run.
mkdir -p $TMPD

# Change directory to submit directory
# and copy executable and input file to scratch space
cp mpicode.x $TMPD
cp input.dat $TMPD

cd $TMPD

# The following two lines provide an example of
#   setting up and running an SGI MPI parallel code.
module load compiler/intel/11.1 mpi/sgi_mpi/1.26
mpiexec_mpt -n 64 ./mpicode.x > out.dat

# The following two lines provide an example of
#   setting up and running an IntelMPI MPI parallel code.
module load compiler/intel/11.1 mpi/intelmpi/3.2
intelmpirun.pbs ./mpicode.x > out.dat

# The following two lines provide an example of
#   setting up and running an openMPI MPI parallel code.
module load compiler/intel/11.1 mpi/openmpi/4.0
openmpirun.pbs ./mpicode.x > out.dat

cp out.dat $PBS_O_WORKDIR

6.9. PBS Commands

The following commands provide the basic functionality for using the PBS batch system:

qsub: Used to submit jobs for batch processing.
qsub [ options ] my_job_script

qstat: Used to check the status of submitted jobs.
qstat PBS_JOBID ## check one job
qstat -u my_user_name ## check all of user's jobs

qdel: Used to kill queued or running jobs.

A more complete list of PBS commands is available in the Harold PBS Guide.

6.10. Advance Reservations

An Advance Reservation Service (ARS) is available on Harold for reserving cores for use, starting at a specific date/time, and lasting for a specific number of hours. The specific number of reservable cores changes frequently, but is displayed on the reservation page for each system in the ARS. The ARS is accessible via most modern web browsers at Authenticated access is required. An ARS User's Guide is available online once you have logged in.

7. Software Resources

7.1. Application Software

All Commercial Off The Shelf (COTS) software packages can be found in the $CSI_HOME (/usr/cta) directory. A complete listing of software on Harold with installed versions can be found on our software page. The general rule for all COTS software packages is that the two latest versions will be maintained on our systems. For convenience, modules are also available for most COTS software packages.

7.2. Useful Utilities

The following utilities are available on Harold:

Useful Utilities
check_license Checks the status of ten HPCMP shared applications grouped into two distinct categories: Software License Buffer (SLB) applications and non-SLB applications. check_license package
node_use Displays memory-use and load-average information for all login nodes of the system on which it is executed. node_use -a
qpeek Returns the standard output (STDOUT) and standard error (STDERR) messages for any submitted PBS job from the start of execution. qpeek PBS_JOB_ID
qview Lists the status and current usage of all PBS queues on Harold. "qview -h" shows all the qview options available.
show_queues Lists the status and current usage of all PBS queues on Harold. show_queues
show_storage Provides quota and usage information for the storage areas in which the user owns data on the current system. show_storage
show_usage Lists the project ID and total hours allocated / used in the current FY for each project you have on Harold. show_usage
show_user_pbs Shows a synopsis of your currently submitted PBS jobs, including: jobs/cores pending, jobs/cores running, jobs/cores held. show_user_pbs [user_name]
if user_name is omitted, current user_name is used

7.3. Sample Code Repository

The Sample Code Repository is a directory that contains examples for COTS batch scripts, building and using serial and parallel programs, data management, and accessing and using serial and parallel math libraries. The $SAMPLES_HOME environment variable contains the path to this area, and is automatically defined in your login environment. Below is a listing of the examples provided in the Sample Code Repository on Harold.

Sample Code Repository on Harold
Application-specific examples; interactive job submit scripts; use of the application name resource; software license use.
abaqusBasic batch script and input deck for an Abaqus application.
abinitBasic batch script for an ABINIT application.
adfBasic batch script and input deck for an ADF application.
ale3dBasic batch script and input deck for an ALE3D application.
amberBasic batch script for an AMBER9 application.
ansysBasic batch script for an ANSYS application.
autodynBasic batch script for the autodyn tool from ANSYS
castepBasic batch script and input deck for an CASTEP application.
cfd++Basic batch script and input deck for an CFD++ application.
cfx5solveBasic batch script and input deck for an CFX Solver application.
cobaltBasic batch script and input deck for an COBALT application.
comsolBasic batch script and input deck for an COMSOL application.
cthBasic batch script and input deck for an CTH application.
discoverBasic batch script and input deck for an DISCOVER application.
dmol3Basic batch script and input deck for an DMOL3 application.
epicBasic batch script for an EPIC application.
espressoBasic batch script for an ESPRESSO application.
fluentBasic batch script and input deck for a FLUENT (now ACFD) application.
gamessBasic batch script and input deck for a GAMESS application.
GAMESSauto_submit script and input deck for a GAMESS application.
gaspBasic batch script for a GASP application.
gaussianInput deck for a GAUSSIAN application and automatic submission script for submitting a Gaussian job.
gromacsBasic batch script for a GROMACS application.
gulpBasic batch script for a GULP application.
lammpsBasic batch script and input deck for a LAMMPS application.
ls-dynaBasic batch script and input deck for a LS-DYNA application.
matlabBasic batch script and sample m file for a MATLAB application.
mcnpxBasic batch script for a MCNPX application.
mesodynBasic batch script for a MESODYN application.
mesotekBasic batch script for a MESOTEK application.
molproBasic batch script and input file for a MOLPRO application.
namdBasic batch script for a NAMD application.
nwchemBasic batch script and input file for a NWCHEM application.
OPENFOAMBasic batch script for an OPENFOAM application.
overflowBasic batch script for an OVERFLOW application.
STARCCM+Basic batch script and input deck for a STARCCM+ application.
xpatchBasic batch script and input deck for an XPATCH application.
Archiving and retrieving files; Lustre striping; file searching; $WORKDIR use.
OST_StripesInstructions and examples for striping large files on the Lustre file systems.
pre_post_ExampleSample batch script showing how to stage data out after a job executes.
Transfer_Queue_ExampleSamples batch script on using the transfer queue.
MPI, OpenMP, and hybrid examples; single-core jobs; large memory jobs; running multiple applications within a single batch job.
HybridSimple MPI/OpenMP hybrid example and script.
MPI_examplesSimple MPI examples and batch scripts for SGI MPT, Intel MPI and OpenMPI.
OpenMPSimple Open MP example and batch script.
Serial_runSimple batch script to run a single core job.
Basic code compilation; debugging; use of library files; static vs. dynamic linking; Makefiles; Endian conversion.
BLACS_ExampleSample ScaLAPACK Fortran program, compile sscript and PBS submission scripts.
Endian_ConversionInstructions on how to manage data created on a machine with different Endian format.
MPI_ddtInstructions on how to use the DDT debugger to debug MPI code.
MPI_CompilationInstructions and sample scripts for using the versions of MPI
MPI_ExamplesInstructions on how to build parallel codes with each compiler/MPI suite combination available on the system.
MPI_Totalview Instructions on how to use the TotalView debugger to debug MPI code.
ScaLAPACK_ExampleSample ScaLAPACK Fortran program, compile sscript and PBS submission scripts.
Serial_TotalviewInstructions on how to use the TotalView debugger to debug serial code.
Timers_FortranSerial Timers using Fortran Intrinsics f77 and f90/95.
Use of modules; customizing the login environment.
Module_Swap_ExampleInstructions for how use the module swap command.
Basic batch scripting; use of the transfer queue; job arrays; job dependencies; Secure Remote Desktop; job monitoring.
BatchScript_ExampleSample basic PBS batch script.
Hybrid_ExampleSimple MPI/OpenMP hybrid example and script.
Job_Array_ExampleInstructions and example job script for using job arrays.
MPI_ExampleSample script for running MPI jobs.
OpenMP_ExampleSample script for running OpenMP jobs.
Serial_ExampleSample script for running multiple sequential jobs.
Transfer_QueuePBS batch script example for data transfer.

8. Links to Vendor Documentation

SGI Home:
SGI Altix ICE:

Novell Home:
Novell SUSE Linux Enterprise Server:

GNU Home:
GNU Compiler:

Intel Home:
Intel Xeon 5500 Processor:
Intel Software Documentation Library:

Linux High Performance Technical Computing: