- Overview
- Running MPI Programs
- Queues
- Disk Space for Batch Jobs
- LSF batch Commands
- bsub
- bsub for Interactive Jobs
- bjobs
- bhist
- bkill
- bacct
- bpeek
- Sample Batch Script
- Managing Batch Scripts
- LSF Documentation
1. Overview
Tungsten runs the job manager LSF (Load Share Facility) batch, a load-sharing batch system
from Platform Computing.
See the lsfintro man page for a
description about LSF, and the lsfbatch man page for a list
of batch commands available in LSF batch.
The access nodes are restricted to compiling tasks and have a runtime limit of 30 minutes. Processes that exceed the limits may be terminated. Use
interactive job submissions for debugging.
2. Running MPI Programs
The Xeon Cluster uses
ChaMPIon/Pro
for running MPI programs.
Instead of using
mpirun to run MPI programs, use the
cmpirun
command. For example, to run the program testMPI on 4 processors:
cmpirun -np 4 -lsf testMPI < myin > myout
See
"cmpirun -h" for the short help.
There are many environment variables and options
that can be used with
cmpirun. Some of the most frequently used are:
- -gdb
- Starts the job under the gdb debugger.
- -lsf
- Needed to work with machines allocated by lsf.
- -mpi_debug
- enables extra checking in the ChaMPIon/Pro library.
- -mpi_verbose
- enables verbose output from the ChaMPIon/Pro library.
- -np n
- Specifies the number of processors n to run on.
- -poll
- Turns on polling mode for Myrinet jobs. This is especially useful for jobs with large numbers of small messages.
- -scale_level 2
- Set to 2 when running with more than 1000 processors.
- -timeout seconds
- Sets the timeout period during startup.
- -tv
- Starts the job under the totalview debugger.
- -verbose
- displays errors and warnings.
Also see Debugging in the ChaMPIon/Pro environment for
cmpirun debugging options.
3. Queues
The following queues are currently available for users:
| Queue | Walltime | Max # Nodes |
| debug | 30 mins | 8 |
| normal (default) | 48 hours | 512(*) |
|
long | 100
hours | 512(*) |
(*) We recommend that you limit the number of nodes per job to 512.
4. Disk Space for Batch Jobs
The system creates a scratch directory for each running batch
job. The job directory is created for you when LSF starts your job
and is accessible within the batch script using the $SCR
environment variable.
See the sample batch script
on how to use $SCR in a batch job.
The cdjob command can be used to change the
working directory to the scratch directory of a running batch job. The syntax is
:
cdjob jobid
Your job scratch directory may be deleted soon
after your job completes, so you should take care to
transfer results to the mass storage system at the end of your job script.
5. LSF batch Commands
A complete list of LSF batch commands can be found in the
man page
for lsfbatch.
Below are brief descriptions of the more useful commands.
For more detailed information, refer to the individual man pages.
5.1 bsub
The bsub command is used to submit a batch job to a queue.
- All options to bsub can be specified either on the command line or as a line in a script (known as an embedded option). If embedded options are used, the script must be submitted using the following format:
bsub < script_name
where script_name is the name of the script and the < is required. Scripts submitted this way are spooled, meaning the system saves a copy of the script. Hence, changing the script file after the job is submitted does not affect execution.
To execute a script in C shell, use the following as the first line of your script:
#!/bin/csh
- To use embedded bsub options in batch scripts, begin each line containing options with #BSUB (leave at least one blank space between the BSUB and the start of the first option).
- The main bsub commands are listed below.
The sample batch script illustrates
bsub usage and options.
Also see the bsub man page for other options.
-
-n proc
specifies the number of processes (default = 1). This is the maximum number
of active processes at any given time during the lifetime of the job.
If different numbers of processors are used over the lifetime of
the job, you must specify the maximum number used.
- -W run time limit
specify total job wall clock time (default = 30 mins). The syntax
is [hour:]minute.
-
-R "span[ptile=X]" Specify that the job should use one or two processors per node (default = 2).
- -o out_file
store the standard output/error of the job to file out_file.
- -J job_name specify a job name.
-
-N:
send mail at the end of a job.
- -P psn:
charge your job to a specific project (PSN).
- -q queuename:
submit your job to the queuename queue.
5.1.1 bsub for Interactive Jobs
The -Is option tells bsub you want to run an interactive job. You can also
use other bsub options such as those documented in the
sample batch script.
For example, the following command:
bsub -Is -n4 -W 1:00 tcsh
will run an interactive job on 4 processors using tcsh with a wallclock limit of 1 hour.
After you enter the command, you will have to wait for lsf to start the
job. As with any job, your interactive job will wait in the queue until
the specified number of nodes is available. If you specify a small
number of nodes, the wait will be shorter.
When you are done with your runs, you can use the exit command to end
the job.
You will be charged for the wall clock time used by all requested nodes until you end the job.
5.2 bjobs
The bjobs command displays the status of jobs.
Enter bjobs to find the status of your jobs. To limit the output to a particular job, specify the jobid
on the command line. To find the status of all jobs on the system use the -u all option.
For example, the following command returns information on all jobs currently in the queue:
% bjobs -u all
JOBID USER STAT QUEUE FROM EXEC JOB_NAME NDS WALL ELAP
67513 jdoe RUN normal tuna tuna isajob2 32 11:00 11:04
67518 smith RUN normal tuna tuna deltatest 48 12:00 6:24
67519 brown RUN normal tuna tuna testjob 12 12:00 3:40
67570 jdoe RUN normal tunb tuna 32run 32 6:00 2:46
67529 plum RUN normal tuna tunb bigset 16 12:00 2:37
67572 black RUN normal tuna tuna interactive 2 6:00 2:37
67846 jdoe RUN normal tuna tunb 2short 32 3:00 1:21
56534 brown PEND normal tuna 256 2:00
58901 white PEND normal tunb runit 256 12:00
58931 white PEND normal tunb bench 256 0:30
67517 jdoe PEND normal tuna 200run 200 12:00
Popular bjobs options:
- -r: prints information only about running jobs
- -l: prints more detailed information, can be used with a jobid or -u all
The following command will print detailed information on job 67513: bjobs -l 67513
- -q: prints information only about jobs in a particular queue
The following command prints information about all jobs in the production queue: bjobs -u all -q normal
For a full list of bjobs options, see the bjobs man page.
On Tungsten, bjobs is actually an NCSA wrapper for the real LSF bjobs command.
It was created to eliminate the listing of the nodes in a running jobs as well as
display some new columns:
- NDS: the number of nodes requested
- WALL: the wall clock limit
- ELAP: the number of hours that the job has been running (format HH:MM)
- EXEC: has been changed to indicate which subcluster the job is running on instead
of displaying the full list of compute nodes in the job
Users can still run the real LSF bjobs command by specifying ${LSF_BINDIR}/bjobs.
5.3 bhist
The bhist command
displays the history of batch jobs in the LSF batch system.
See the man page for more information. For older jobs, make sure to use
the -n option to specify the number of event log files that
bhist searches. The default is 1; i.e., the current event log file.
For example,
bhist -n4 -l jobid gives detailed information on a particular job that ran in the last few days.
bhist -n4 -l -a -u userid gives detailed information
on all jobs in the last few days for a particular user.
5.4 bkill
The bkill command deletes a queued job or kills a running job. Obtain the jobid using the bjobs command. Using the sample session shown above, user plum deletes his batch job by entering:
% bkill 67529
Job deleted.
5.5 bacct
The bacct command displays accounting information that LSF batch keeps on completed batch jobs.
- bacct -l jobid
- gives detailed information on a particular job (use the bhist command to find your jobid)
- bacct -b -u userid
- gives a summary of information on all jobs for a particular user
- bacct -l -C 2004/04/20,2004/04/22 -u userid
- gives detailed information on all jobs for a particular user completed between the days specified.
NOTE: NCSA system accounting used to compute CPU usage is done
separately from that of LSF batch, so accounting information returned by
bacct should be treated as approximate.
5.6 bpeek
The bpeek command displays the
stdout and stderr output of a unfinished batch job in the LSF batch system
up to the time that this command is invoked.
It is useful for monitoring the progress of a job and identifying errors.
Users can only invoke bpeek
on their own jobs. Enter bpeek jobid to get
information on a particular job.
6. Sample Batch Script
A sample LSF batch script for a ChaMPIon/Pro MPI job is available in
/usr/local/doc/lsf that you can copy and modify
as needed for your own use.
The sample batch script uses
scratch space for batch jobs
($SCR).
It also uses UniTree
for permanent storage of files. It assumes that the executable and any
input files are already on UniTree. If that's not true in your case
or if you have problems with UniTree within batch jobs, see this FAQ.
7. Managing Batch Scripts
There is a program named
find_batch_scripts that will help you locate batch scripts
on the system [should you forget their location].
8. LSF Documentation (PDF)
Note: These documents are only available to NCSA HPC users and require an NCSA login.
Top