- Introduction
- Checking memory use
- PBS job memory enforcement
1. Introduction
Starting Monday September 25, 2006,
memory specification has been required and enforced for PBS jobs on the NCSA
SGI Altix.
Jobs
are typically terminated if they exceed their requested memory.
See the section PBS job memory enforcement for
details.
2. Checking memory use
- The cobalt cluster monitor web page: co-monitor.ncsa.uiuc.edu can show you job detail including memory use for running jobs. Select the Jobs link on the left, then click on your RUNNING job in the list and see the vmem field under the resources_used section of the page. Here's a sample web display for a job while it was running. Near the top of the job info display, note the resources_used :
- You can also check your memory use for jobs already run with the
qhist command. This command will show a table of your jobs run in the last 2 days and their memory use:
% qhist -g 2 -u $USER -f jobid,jobname,usedmem
Scanning PBS raw accounting records: 09/11/2006 - 08/26/2007
JobId JobName UsedMem
-------------------------------
5230 test 6.86M
5231 test 9.58M
5340 test 14.31M
5401 malloctest 9.23G
-------------------------------
Total # jobs = 4
Total # SUs = 0.02
- An alternate way to check processes currently running in batch is the
qps command. In this example, the name of the application is malloc8g:
% qps
PID PPID COMMAND HOST RSS SIZE S CPU user time system tm
_____ _____ ____________ ___________ _______ _______ _ ___ _________ _________
13640 13587 tcsh co-login1 3.4M 38.4M S 27 00:00:00 00:00:00
16361 8318 tcsh co-compute2 4.2M 7.7M S 168 00:00:00 00:00:01
16522 16361 [5548]*1 co-compute2 3.3M 6.2M S 168 00:00:00 00:00:00
16523 16522 malloc8g co-compute2 4.0G 4.0G R 168 00:00:11 00:00:02
5342 5340 sshd co-viz8 10.0M 19.2M S 0 00:00:00 00:00:00
5343 5342 tcsh co-viz8 4.0M 6.4M S 0 00:00:00 00:00:00
8804 13124 watch co-viz8 2.2M 3.7M S 1 00:00:00 00:00:00
8807 5343 qps co-viz8 30.3M 34.1M S 0 00:00:02 00:00:00
8917 8807 pminfo co-viz8 1.7M 3.7M S 4 00:00:00 00:00:00
12715 12700 sshd co-viz8 10.0M 19.2M S 1 00:00:00 00:00:00
13124 12715 tcsh co-viz8 4.1M 6.4M S 1 00:00:00 00:00:00
3. PBS job memory enforcement
Jobs are
limited to the memory associated with the processors assigned to the
job.
You can expect your jobs to be terminated if they exceed their requested memory. Some jobs using small cpu counts and only a couple of gigabytes of memory, may occasionally "get away" with using more memory and run one time while getting killed the next time. If this happens with your job, request more memory as directed by the email you receive from the job kill daemon.
Here is an example of a job over-memory showing the memory requested, used, and the job standard error file and email sent to the user.
% qsub -lncpus=1,mem=2gb,walltime=00:15:00 -N malloctest myscript.pbs
% qhist 5401
Scanning PBS raw accounting records: 09/08/2005 - 08/26/2007
Compute Host: co-compute2:ssinodes=1:ncpus=2:mem=12065792kb
JobId: 5401
JobName: malloctest
User: arnoldg
Project: 0x8e8dca1e0000025b
Queue: standard
Job limits:
wall clock: 00:15:00
Requested CPUs: 1
Available CPUs: 2
Requested Memory: 2097152kb
Queued: 09/14/06 09:02
Started: 09/14/06 09:03
Ended: 09/14/06 09:07
Usage:
wall clock: 00:03:20
cputime: 00:01:03
SUs: 0.02
memory: 2.53G
[arnoldg@co-login1 ~/c]$ cat *.e5401
set_SCR: using existing PBS job directory /scratch/batch/5401
JOB_OVER_MEMORY
This email notification was sent to the user informing them about what happened to the job:
Date: Thu, 14 Sep 2006 09:05:23 -0500
To: arnoldg@ncsa.uiuc.edu
Subject: Job arnoldg5401.co-master Killed
arnoldg :
Your job 5401.co-master was killed from host co-compute2 because it
attempted to use more memory than exists within the processor sets
allocated to your job. Please modify your batch script to request
more memory for your job. For example, to request 10 Gbytes of
memory for the job use the following line in your batch script:
#PBS -l mem=10gb
to request 10 Gbytes of memory (total) for your batch job.