NCSA Home
Contact Us | Intranet | Search

Running Jobs on the NCSA Intel 64 Linux Cluster

 
  1. Interactive Use
  2. Running Programs
  3. Batch System (Torque)
    1. Scheduling Policies
    2. Queues
    3. Batch Commands
      1. qsub
      2. qsub -I
      3. qstat
      4. qhist
      5. qdel
    4. Sample Batch Scripts
    5. Disk Space for Batch Jobs
  4. Notes

1. Interactive Use

Jobs should not be run on the interactive nodes. Their use is primarily for compiling and building your programs. Instead, please run jobs on the compute nodes. See the section on qsub -I for instructions on how to run an interactive job on the compute nodes.

2. Running Programs

MPI

All the implementations of MPI on the NCSA Intel 64 Linux Cluster have the mpirun script for running an MPI program. See the sample batch scripts for syntax details for the MPI implementations.

Notes:

  • The environment variable $PBS_NODEFILE is automatically defined in a batch job to point to a temporary file that contains the list of nodes assigned to the job.
  • The arguments to mpirun need to come before your executable. Any arguments after your executable are considered to be arguments to your executable.
  • The VMI2 MPI implementation does not propagate environment variables well. The workaround is to create a wrapper script that sets all the environment variables that your code will need along with the executable. Then in your batch script use the wrapper script as the executable in your mpirun line.

OpenMP

Before you run an OpenMP program, set the environment variable OMP_NUM_THREADS to the number of thtreads you want. For example, to run program a.out interactively with two threads:

  setenv OMP_NUM_THREADS 2
  ./a.out

The following environment variables may also be useful in running your OpenMP programs:

OMP_SCHEDULE Sets the schedule type and (optionally) the chunk size for DO and PARALLEL DO loops declared with a schedule of RUNTIME. The default is STATIC.
KMP_LIBRARY sets the run-time execution mode. The default is throughput, but it can be set to turnaround so worker threads do not yield while waiting for work.
KMP_STACKSIZE Sets the number of bytes to allocate for the stack of each parallel thread. You can use a suffix k, m, or g to specify kilobytes, megabytes or gigabytes. The default is 4m.

Hybrid MPI/OpenMP

To run a MPI/OpenMP hybrid program, you need to set the envionment variable OMP_NUM_THREADS to the number of threads you want, and change the number of cpus per node for MPI accordingly. For example, to run a program with 10 MPI ranks and 8 threads for each rank, do the following in your batch script:

  #PBS -l nodes=10:ppn=1
  setenv OMP_NUM_THREADS 8

See the exception with VMI2 in the MPI section above on using a wrapper.

(See the qsub section for information on PBS directives.)

3. Batch System (Torque)

The NCSA Intel 64 Linux Cluster uses the Torque Resource Manager with the Moab Workload Manager for running jobs. Torque is based upon OpenPBS, so the commands are the same as PBS commands.

3.1 Scheduling Policies

As per the resource use guidelines, the scheduling policy on Abe is set to highly favor large node-count jobs.

Also, as with other HPC systems at NCSA, the scheduling policy includes fair-share. This is a policy whereby a job's priority may be increased or decreased because of other jobs that the user may be running or have recently run. Basically, in order to give everyone a fair opportunity to run jobs, a user's job will have a higher priority if that user hasn't run jobs in the recent past. Fair-share also factors in the ratio of the service units the user's project is allocated and the time to the allocation expiration.

To maximize utilization, the scheduler will also back-fill jobs. When trying to schedule large blocks of nodes for large jobs, there are often "holes" where some nodes are idle waiting to be added to a pool to start a large waiting job. The scheduler back-fills smaller jobs into these holes.

When figuring out a job's priority relative to other jobs, there are several factors which are taken into account. Some of these factors include:

  • job size (how many nodes)
  • job expansion factor (the ratio of the time the job has spent eligible to be run versus how much time the job has requested)
  • the raw amount of time the job has spent eligible to be run
  • fair-share factors
A relative weighting of these factors contributes to a job's priority.

A debug queue is available to facilitate fast turnaround on debugging/testing jobs. Jobs in this queue have an intrinsically higher priority; additionally, they accrue priority at a much higher rate because the expansion factor (and its associated priority factor) increases very quickly.

3.2 Queues

The following queues are currently available for users:

QueueWalltimeMax # Nodes
debug30 mins16
normal(default)48 hours600
new wide48 hours1196
long(*)168 hours600

(*)Access to the long queue available by request. Please send email to consult@ncsa.uiuc.edu along with a justification of need.

3.3 Batch Commands

Below are brief descriptions of the useful batch commands. For more detailed information, refer to the individual man pages.

3.3.1 qsub

The qsub command is used to submit a batch job to a queue. All options to qsub can be specified either on the command line or as a line in a script (known as an embedded option). Command line options have precedence over embedded options. Scripts can be submitted using

qsub [list of qsub options] script_name

The main qsub commands are listed below. The sample batch scripts illustrates qsub usage and options. Also see the qsub man page for other options.

  • -l resource-list: specifies resource limits. The resource_list argument is of the form:
    resource_name[=[value]][,resource_name[=[value]],...]:resource
    

    The resource_names are:

    walltime: maximum wall clock time (hh:mm:ss) [default: 10 mins]
    nodes: number of 8-core nodes [default: 1 node]
    ppn: how many cores per node to use (1 through 8) [default: ppn=1]
    resource: resource to be used. The available resource is himem to access the 16 GB memory nodes.
    Note: Specify the himem resource only if you absolutely need the higher memory nodes since it can impact turnaround time of the job.

    Examples:
    #PBS -l walltime=00:30:00,nodes=2:ppn=8
    #PBS -l walltime=00:30:00,nodes=2:ppn=8:himem
    

  • -q queue_name: specify queue name.[default: normal]

  • -N jobname: specifies the job name.

  • -o out_file: store the standard output of the job to file out_file. After the job is done, this file will be found in the directory from which the qsub command was issued. [default :<jobname>.o<PBS_JOBID>]

  • -e err_file: store the standard error of the job to file err_file. After the job is done, this file will be found in the directory from which the qsub command was issued. [default :<jobname>.e<PBS_JOBID>]

  • -j oe: merge standard output and standard error into standard output file.

  • -V: export all your environment variables to the batch job.

  • -m be: send mail at the beginning and end of a job.

  • -M myemail@myuniv.edu : send any email to given email address.

Notes:

  • Using the -N option will generate stdout and stderr files of the form: <jobname>.o<jobid> and <jobname>.o<jobid> respectively in the directory from where the batch job was submitted when used without the -o and -e options.
  • Temporary stdout/stderr files while the job is running are located in the home directory [$HOME/.pbs_spool or $HOME], and named <jobid>.abem5.OU and <jobid>.abem5.ER.

3.3.2 qsub -I

The -I option tells qsub you want to run an interactive job. You can also use other qsub options such as those documented in the batch sample scripts. For example, the following command:

   qsub -I -V -l walltime=00:30:00,nodes=2:ppn=8

will run an interactive job with a wall clock limit of 30 minutes, using two nodes and eight cores per node.

After you enter the command, you will have to wait for Torque to start the job. As with any job, your interactive job will wait in the queue until the specified number of nodes is available. If you specify a small number of nodes for smaller amounts of time, the wait should be shorter because your job will backfill among larger jobs. Once the job starts, you will see something like this:

qsub: waiting for job 1244.abem5.ncsa.uiuc.edu to start
qsub: job 1244.abem5.ncsa.uiuc.edu ready

Now you are logged into the launch node. At this point, you can use the appropriate command to start your program.

When you are done with your runs, you can use the exit command to end the job.

3.3.3 qstat

The qstat command displays the status of batch jobs.
  • qstat -a gives the status of all jobs on the system.
  • qstat -n lists nodes allocated to a running job in addition to basic information.
  • qstat -f PBS_JOBID gives detailed information on a particular job.
    Note: Currently PBS_JOBID needs to be the full extension: <jobid>.abem5.ncsa.uiuc.edu.
  • qstat -q provides summary information on all the queues.

See the man page for other options available.

3.3.4 qhist

qhist, a locally written tool available on the NCSA Intel 64 Linux Cluster, summarizes the raw accounting record(s) for one or more jobs. See the output of "qhist --help" for details.

To display information about a specific job, the syntax is qhist PBS_JOBID.

3.3.5 qdel

The qdel command deletes a queued job or kills a running job. The syntax is qdel PBS_JOBID.

Note: You only need to use the numeric part of the Job ID.

3.4 Sample Batch Scripts

Sample batch scripts are available in the directory /usr/local/doc/batch_scripts for use as a template.

3.5 Disk Space for Batch Jobs

Scratch space for batch jobs is provided via a per-job scratch directory that is created at the beginning of the job. This directory is created under /scratch/batch, and is based on the JobID. If the batch script uses one of the sample scripts as a template, the name of this scratch directory is available to job scripts with the $SCR environment variable.

The cdjob command can be used to change the working directory to the scratch directory of a running batch job. The syntax is

cdjob PBS_JOBID

4. Notes

  • To avoid excessive paging, we recommend restricting job memory to 875MB/core or 7GB/node.
  • While a job is running, you can ssh to the compute nodes on which your job is running. qstat -n provides the list of hosts assigned to your job. The first host on the list is the launch node.

Back to Top