NCSA Home
Contact Us | Intranet | Search

Running Jobs on NCSA IBM pSeries 690

 
  1. Overview
  2. Running Programs
    1. serial
    2. OpenMP
    3. mpi
  3. Interactive Access
  4. Classes [Queues] and the max. number of running jobs
  5. Disk space for batch jobs
  6. Workload Management commands
    1. llsubmit
    2. llq
    3. llcancel
    4. llsummary
    5. llhist
    6. llhosts
  7. Sample LoadLeveler Scripts
  8. Managing Batch Scripts
  9. Automated Saving of Files from Batch Jobs
  10. Notes
  11. References

1. Overview

The NCSA IBM pSeries 690 uses the IBM LoadLeveler workload management system with the Moab scheduler.

2. Running Programs

2.1 serial

To run a serial program, just enter the executable name and any necessary arguments at the shell prompt:

% ~HOME/c/hello_world
hi from c
hi from fortran

2.2 OpenMP

To run an OpenMP program, set the environment variable OMP_NUM_THREADS to the desired number of threads, then enter the executable name and any necessary arguments at the shell prompt as with a serial program.

% cc_r -qsmp=omp -o test_openmp test.c
% setenv OMP_NUM_THREADS 2
% ./test_openmp
omp_get_dynamic=1
omp_get_num_procs=16
omp_get_max_threads=2
Threads allocated : 2
%

2.3 mpi

The Parallel Operating Environment is used for running MPI programs. Instead of using mpirun to run MPI programs, use the poe command. List the poe options after the program options and any file redirections. For example, to run a program on 4 processors:

   poe a.out < myin > myout -procs 4
See "poe -h" for the short help. There are many environment variables and options that can be used with poe. When debugging MPI programs, consider using the following poe options:

-infolevel 1
poe displays errors and warnings
-labelio yes
output from the parallel tasks is labeled by task ID
-euidevelop yes
the message passing interface performs more detailed checking during execution. This additional checking is intended for developing applications and can significantly slow performance.
Here is an example of the poe command with various debug options:
% poe allall 100 100 2000 -procs 2 -labelio yes -infolevel 1 -euidevelop yes
   0:Node 0 Complete...
   0:Host Cu12 to Host Cu12(1): 1642.407474MB/sec
   1:Node 1 Complete...
   1:Host Cu12 to Host Cu12(0): 1623.271113MB/sec

3. Interactive Access

The machine cu.ncsa.uiuc.edu is available for interactive access. It has 32 processors and 64 gigabytes of memory available to support interactive users. User limits (for all active login sessions) are as follows:

  • a total of 4 processes
  • 1 Gbyte memory per process
  • CPU time of 30 mins per process
Jobs exceeding the above policy are terminated. In general, interactive use should be limited to compiling and other development tasks, such as editing source and debugging.

For users with multiple projects, use the newgrp command to change projects for charging purposes when working interactively. See the newgrp manual page for more information [man newgrp].

4. Classes [Queues] and the max. number of running jobs.

The following classes are currently available for users:

class		wall_clock_limit    max processors per job    max memory per job
--------------------------------------------------------------------------------
debug           00:30:00             4                        128GB
batch          100:00:00            16                        128GB 
cap  (*)       600:00:00            16                        128GB
dedicated      100:00:00            32                        256GB

(*) The cap queue is open to all users as of February 2007.

As of March 2006, there are no formal limits on the number of running jobs you may have in queue. A "fair share" policy and queue wait times are evaulated to determine the mix of running jobs.

Note: The NCSA IBM p690 does not support jobs across multiple hosts.

5. Disk space for batch jobs

The IBM p690 system creates a scratch directory for each running batch job. The job directory is created for you when LoadLeveler starts your job and is accessible within the batch script using the $SCR environment variable. See the sample scripts in /usr/local/doc/ll/ for examples that use $SCR.

You can use the interactive system (cu12) to view your data while your job is running. The cdjob command can be used to change the working directory to the scratch directory of a running batch job. The syntax is:

     cdjob jobid 

Your job scratch directory may be deleted soon [possibly immediately] after your job completes, so you should take care to transfer results to the mass storage system (see the section Automated Saving of Files from Batch Jobs).

Please contact the NCSA Consulting Office (consult@ncsa.uiuc.edu) if you plan to use more than 100 Gigabytes of disk space in a single job.

6. Workload Management commands [LoadLeveler / batch]

You could read the entire LoadLeveler book, but that is not as exciting as it sounds to you right now, and you may be eager to get some science done today. Here you'll find the basic commands to help you work with the batch environment.

6.1 llsubmit

This command submits a batch script to the LoadLeveler environment. With LoadLeveler, a batch script may contain multiple job steps, and each step may run independently on different batch machines. The LoadLeveler scripts can be as complex as many progr ams. A couple of sample batch scripts are provided.

Example usage:
Cu12:~101% llsubmit ll.job
llsubmit: The job "cu12.ncsa.uiuc.edu.1211" has been submitted.

Below are some common LoadLeveler directives you can use in your LoadLeveler scripts. See the IBM LoadLeveler directives documentation for details and other directives available. Note that LoadLeveler does not support specification of directives on the command line.

shell
Specifies the name of the shell to use. If not specified, the shell listed in the owner's password file entry is used. The syntax is:
        #@ shell = name
job_type
Specifies whether the job is serial or parallel. The default is serial. The syntax is:
        #@ job_type = string
For example, to specify an MPI or OpenMP job:
        #@ job_type = parallel
environment
Specifies your initial environment variables in your job. Separate environment specifications with semicolons. An environment specification may be one of the following:
    COPY_ALL specifies that all the environment variables from your shell be copied.
    $var specifies that the environment variable var be copied into the environment of your job when LoadLeveler starts it.
    !var specifies that the environment variable var not be copied into the environment of your job when LoadLeveler starts it. This is most useful in conjunction with COPY_ALL.
    var=value specifies that the environment variable var be set to the value value and copied into the environment of your job when LoadLeveler starts it.
The syntax is:
	#@ environment = env1 ; env2 ; ... 
For example:
	#@ environment = COPY_ALL; !env2;
notification
Specifies when the LoadLeveler system sends mail to you. The syntax is:
	#@ notification = always|error|start|never|complete 
where:
    always notify the user when the job begins, ends, or if it incurs error conditions.
    error notify the user only if the job fails.
    start notify the user only when the job begins.
    never never notify the user.
    complete notify the user only when the job ends. This is the default.

For example, if you want to be notified with mail only when your job step completes, your notification keyword would be:

	#@ notification = complete
class
Specifies the name of a job class (default: batch) The syntax is:
        #@ class = name
For example, to submit jobs to a class called batch, your class keyword would look like the following:
        #@ class = batch
A LoadLeveler class is simlar to a queue in other batch systems.

account_no
Specifies the account name string for the job [for charging to projects]. The syntax is:
        #@ account_no = abc

tasks_per_node
Specifies the number of tasks of an MPI parallel program you want to run. For OpenMP, threaded, or serial programs, you can take the default (1) and omit this directive. The value of the tasks_per_node keyword applies only to the job step in which you specify the keyword. (That is, this keyword is not inherited by other job steps.) The syntax is:
        #@ tasks_per_node = number

Where number is the number of tasks or processes you want to run per node. The default is one task per node.

resources
Specifies quantities of the consumable resources "consumed" by each task or process in the job step. For OpenMP, or threaded programs, set ConsumableCpus(N), where N is the number of threads you plan to employ (note: you will still need to set OMP_NUM_THREADS for OpenMP programs). For OpenMP and serial programs, ConsumableMemory is the total memory your program can use. For MPI programs, ConsumableMemory is the total memory each MPI task or process can use. The syntax is:
	resources=name(count) name(count) ... name(count)

Here is an example for an OpenMP program using 4 threads and a total of 1 gigabyte of memory:

	#@ resources = ConsumableCpus(4) ConsumableMemory(1 gb)

This is an example for an MPI program that will require 500 megabytes per process or task:

	#@ resources = ConsumableCpus(1) ConsumableMemory(500 mb)
wall_clock_limit
Sets the limit for the elapsed time for which a job can run. In computing the elapsed time for a job, LoadLeveler considers the start time to be the time the job is dispatched. The default value is 30 minutes (00:30:00). The syntax is:

        #@ wall_clock_limit = limit

An example is:

        #@ wall_clock_limit = 5:00
job_name
Specifies the name of the job. The syntax is:
job_name = job_name
output
Specifies the name of the file to use as standard output (stdout) when your job step runs. If not specified, the file /dev/null is used [the output will be discarded]. The syntax is:
	#@ output = filename 
For example:
 
	#@ output = out.$(jobid) 
error
Specifies the name of the file to use as standard error (stderr) when your job step runs. If you do no specify this keyword, the file /dev/null is used [the standard error stream will be discarded]. The syntax is:
        #@ error = filename

For example:

        #@ error = $(jobid).$(stepid).err
queue
Places one copy of the job step in the queue. This statement is required. The queue statement marks the end of a job step. Note that you can specify statements between queue statements. The syntax is:
	#@ queue

6.2 llq

To view the current queue of job steps, run llq:
% llq
Id                       Owner      Submitted   ST PRI Class        Running On
------------------------ ---------- ----------- -- --- ------------ -----------
cu12.6.0                 arnoldg    12/23 14:52 R  50  batch        cu10
cu12.7.0                 arnoldg    12/23 14:52 ST 50  batch        cu08

2 job step(s) in queue, 0 waiting, 1 pending, 1 running, 0 held, 0 preempted

The values of the ST (state) field can be:

       C   Completed
      CA   Canceled
      CK   Checkpointing
      CP   Complete Pending
       D   Deferred
       E   Preempted
      EP   Preempt Pending
       H   User Hold
      HS   User Hold and System Hold
       I   Idle
      MP   Resume Pending
      NR   Not Run
      NQ   Not Queued
       P   Pending
       R   Running
      RM   Removed
      RP   Remove Pending
       S   System Hold
      ST   Starting
      SX   Submission Error
      TX   Terminated
       V   Vacated
      VP   Vacate Pending
       X   Rejected
      XP   Reject Pending
Here's a list of some helpful flags you can use with llq:
-x
Provides extended information about a selected job.
-s
Provides information on why a selected list of jobs remain in the Hold, NotQueued, Idle or Deferred state. Example:
     % llq -s 1169
     ...
     ==================== EVALUATIONS FOR JOB STEP cu12.ncsa.uiuc.edu.1169.0 ====

     The class of this job step is "batch".
     Total number of available initiators of this class on all machines in 
     the cluster: 0
     Minimum number of initiators of this class required by job step: 16
     The number of available initiators of this class is not sufficient for 
     this job step.
-l
Specifies that a long listing be generated for each job for which status is requested.
-w
Provides AIX WLM CPU and real memory statistics for running jobs only. This option only accepts a single hostname and a single step id when used in conjunction with the -h flag. The following statistics are displayed for every node the job is running on:
  • Current CPU resource consumption as a percentage of the total resources available
  • Total CPU time consumed in milliseconds
  • Current real memory consumption as a percentage of the total resources available
  • The highest number of resident memory pages used
Example:
      Cu12:% llq -w
      =============== Job Step cu12.ncsa.uiuc.edu.1691.0 ===============
      cu06.ncsa.uiuc.edu:
                Resource: CPU
                        snapshot: 100
                        total: 152681218
                Resource: Real Memory
                        snapshot: 10
                        high water: 5990157
-u userlist
Is a blank-delimited list of users. When used with -h option, only the user's jobs monitored on the machines in the hostlist are queried. When used alone, only the user's jobs monitored on the schedd host are queried.
-h hostlist
Is a blank-delimited list of machines. If the -s flag is not specified, all jobs monitored on the machines in the hostlist are queried. If the -s flag is specified, the list of machines is considered when determining why a job remains in Idle state. When used with -u option, the userlist is used to further select jobs for querying.
-c classlist
Is a blank-delimited list of classes. If -s option is specified, this option is ignored. When used with -h option, only the classes specified on the machines in the hostlist are queried. When used alone, only the classes specified on the schedd host are queried.

6.3 llcancel

Should you want to cancel a job, there's llcancel to the rescue: The syntax is

llcancel JobID
% llcancel 1713
llcancel: Cancel command has been sent to the central manager.
You can cancel all of your jobs quickly with the -u flag:
 -u userlist

     Is a blank-delimited list of users. When used with
     the -h option, only the user's jobs monitored on the
     machines in the hostlist are canceled. When used alone, only
     the user's jobs monitored by the machine issuing the command
     are canceled.

6.4 llsummary

llsummary will show information about jobs that have completed.
Cu12:~207% llsummary -l -j cu12.2411 | head -19
================== Job cu12.ncsa.uiuc.edu 2411 ==================
             Job Id: cu12.ncsa.uiuc.edu 2411
           Job Name: cu12.ncsa.uiuc.edu.2411
  Structure Version: 210
              Owner: arnoldg
         Unix Group: aau
    Submitting Host: cu12.ncsa.uiuc.edu
  Submitting Userid: 25114
 Submitting Groupid: 1023
    Number of Steps: 3
------------------ Step cu12.ncsa.uiuc.edu.2411.0 ------------------
        Job Step Id: cu12.ncsa.uiuc.edu.2411.0
          Step Name: 0
         Queue Date: Thu Jan 30 11:03:44 CST 2003
         Dependency:
             Status: Removed
      Dispatch Time: Thu Jan 30 11:03:55 CST 2003
         Start Time: Thu Jan 30 11:03:55 CST 2003
    Completion Date: Thu Jan 30 11:04:40 CST 2003

6.5 llhist

The llhist command provides resource usage information for currently running and completed LoadLeveler batch jobs. The syntax is llhist JOBID.

Example:

Cu12:~102% llhist 17420
--------------------------------------------------------
        IBM pSeries 690 Batch Job Summary
--------------------------------------------------------
  Job Id           :     17420
  Job Name         :     mumps1M_symm_32
  User             :     skoric
 
 
--------------------------------------------------------
                STEP 0
--------------------------------------------------------
  Job Status       :     Completed ...
 
  Submitted        :     Tue Apr  1 21:27:52 CST 2003
  Started          :     Tue Apr  1 21:49:38 CST 2003
  Finished         :     Tue Apr  1 22:05:17 CST 2003
  Host             :     cu01.ncsa.uiuc.edu
  Project          :     acr
  Class            :     dedicated
 
Usage:
 
  Cpu Time         :     07:38:24  [hh:mm:ss]
  Run Time         :     00:15:39
  Peak Task Memory :     3.73 GB
  Service Units    :     8.35
 
Limits:
 
  Wall Clock Limit :     00:45:00  [hh:mm:ss]
  Number of CPU's  :     32
  Memory           :     125.00 GB 

See the man page for details of the output.

6.6 llhosts

The llhosts command provides machine utilization information. The syntax is llhosts .

Example:

Cu12:% llhosts
 host  jobs  Gb_free  Startd      00     25      50     75    100  % load
_________________________________________________________________________
 cu01     0    226    Idle       |
 cu02     3    217    Running    |------------------------------
 cu03     8    182    Running    |-------------------------------*
 cu04     3     47    Running    |-------------------------------***
 cu05    13     49    Running    |-----------------------
 cu06     0    219    Idle       |
 cu07     3     46    Running    |----------------
 cu08     0      6    Idle       |
 cu09     7     34    Running    |-----------------------
 cu10     5     53    Running    |--------------------------
 cu11     1     35    Running    |
 cu12     0     18    Idle       |----------

See the man page for details of the output.

7. Sample LoadLeveler Scripts

See the samples in /usr/local/doc/ll/ . You can copy one that's similar to what you want to do and customize it for your requirements. Some advanced LoadLeveler scripts are also being developed.

The sample batch scripts use UniTree for permanent storage of files. They assume that the executable and any input files are already on UniTree. If that's not true in your case or if you have problems with UniTree within batch jobs, see this FAQ.

8. Managing Batch Scripts

There is a program named find_batch_scripts that will help you locate batch scripts on the system [should you forget their location].

9. Automated Saving of Files from Batch Jobs

The saveafterjob utility on the NCSA IBM p690 is available for automated, guaranteed saving of output files from batch jobs to the mass storage system. For details on its use, see the saveafterjob page and the sample LoadLeveler scripts.

10. Notes

The standard output and standard error [from "output =" and "error =" in your LoadLeveler scripts] will be placed in the directory you were in at the time you submitted the script with llsubmit. If you're working in a shared filesystem [ nfs mount, or gpfs filesystems ] you can watch the output spool in real time by using the tail command:

% tail -f cu12.33.0.out
   0:
   0:Running on 16 PEs
   0:
   0:sampling from 2^0 to 2^24  bytes
   0:
   0:Effective Bandwidth: 3760.54 [MB/sec]
   0:
   0:
   0:***********************************************************
   0:
If you submit a job from a local filesystem that is not shared [local scratch, /tmp, or /var/tmp], your standard output and error files will be in the same directory on the execution machine for that job step. If that's the case, take care to copy them t o mass storage or a shared filesystem at the end of the job step --otherwise they'll be stranded on the execution machine and you will not be able to see them.

11. References

Submitting and managing LoadLeveler jobs

"Diagnosis and Messages Guide"

IBM Parallel Environment for AIX (PE)

Parallel Environment diagnostic and error messages , and the POE Hitchhiker's guide.