- Introduction
- ProPack 3 vs. ProPack 4
- PBS job memory enforcement
- Checking memory use
- Unaligned Access messages
- Other user impact
1. Introduction
The NCSA SGI Altix system will go into production under SGI ProPack 4 on
Monday September 25, 2006.
Please note the following important change in ProPack 4:
Memory specification with PBS jobs is enforced. You can expect your jobs
to be terminated if they exceed their requested memory.
See the section PBS job memory enforcement for
details.
As of September 19, new jobs are not being accepted to run under ProPack 3.
This means that no new jobs may be submitted on
co-login1.ncsa.uiuc.edu. We
have begun accepting job submissions to run in the production environment via
co-login2.ncsa.uiuc.edu - these will run starting September 25. They should
be submitted explicitly to the standard queue via the
PBS -q directive.
Until that date, the default queue is set to fuser (for friendly user).
Please submit jobs to this queue to run now. Be aware that at the start of
production, any remaining queued jobs in the fuser queue will be purged.
2. ProPack 3 vs. ProPack 4
SGI ProPack 3 was based on Red Hat Enterprise Linux Advanced Server 3
which uses the Linux 2.4 kernel. ProPack 4 is based on SUSE Linux
Enterprise Server 9 (SLES9) which uses the Linux 2.6 kernel. Here is
a brief summary of the differences:
ProPack 3 ProPack 4
--------- ---------
Distribution RedHat SLES9
Linux Kernel 2.4.21 2.6.5
glibc 2.3.2-95 2.3.3-98
batch system pbs-5.4.1 pbs-7.1.1
3. PBS job memory enforcement
On ProPack 3, batch jobs could use memory associated with processors
that were assigned to other jobs. Starting with ProPack 4, jobs are
limited to the memory associated with the processors assigned to the
job.
You can expect your jobs to be terminated if they exceed their requested memory. Some jobs using small cpu counts and only a couple of gigabytes of memory, may occasionally "get away" with using more memory and run one time while getting killed the next time. If this happens with your job, request more memory as directed by the email you receive from the job kill daemon.
Here is an example of a over-memory job showing the memory requested, used, and the job standard error file and email sent to the user.
% qsub -lncpus=1,mem=2gb,walltime=00:15:00 -N malloctest myscript.pbs
% qhist 5401
Scanning PBS raw accounting records: 09/08/2005 - 08/26/2007
Compute Host: co-compute2:ssinodes=1:ncpus=2:mem=12065792kb
JobId: 5401
JobName: malloctest
User: arnoldg
Project: 0x8e8dca1e0000025b
Queue: standard
Job limits:
wall clock: 00:15:00
Requested CPUs: 1
Available CPUs: 2
Requested Memory: 2097152kb
Queued: 09/14/06 09:02
Started: 09/14/06 09:03
Ended: 09/14/06 09:07
Usage:
wall clock: 00:03:20
cputime: 00:01:03
SUs: 0.02
memory: 2.53G
[arnoldg@co-login1 ~/c]$ cat *.e5401
set_SCR: using existing PBS job directory /scratch/batch/5401
JOB_OVER_MEMORY
This email notification was sent to the user informing them about what happened to the job:
Date: Thu, 14 Sep 2006 09:05:23 -0500
To: arnoldg@ncsa.uiuc.edu
Subject: Job arnoldg5401.co-master Killed
arnoldg :
Your job 5401.co-master was killed from host co-compute2 because it
attempted to use more memory than exists within the processor sets
allocated to your job. Please modify your batch script to request
more memory for your job. For example, to request 10 Gbytes of
memory for the job use the following line in your batch script:
#PBS -l mem=10gb
to request 10 Gbytes of memory (total) for your batch job.
4. Checking memory use
The cobalt cluster monitor web page: co-monitor.ncsa.uiuc.edu can show you job detail including memory use for running jobs. Select the Jobs link on the left, then click on your RUNNING job in the list and see the vmem field under the resources_used section of the page. Here's a sample web display for a job while it was running.
You can also check your memory use for jobs already run with the qhist command. This command will show a table of your jobs run in the last 2 days and their memory use:
% qhist -g 2 -u $USER -f jobid,jobname,usedmem
Scanning PBS raw accounting records: 09/11/2006 - 08/26/2007
JobId JobName UsedMem
-------------------------------
5230 test 6.86M
5231 test 9.58M
5340 test 14.31M
5401 malloctest 9.23G
-------------------------------
Total # jobs = 4
Total # SUs = 0.02
An alternate way to check processes currently running in batch is the qps command:
% qps
PID PPID COMMAND HOST RSS SIZE S CPU user time system tm
_____ _____ ____________ ___________ _______ _______ _ ___ _________ _________
13640 13587 tcsh co-login1 3.4M 38.4M S 27 00:00:00 00:00:00
16361 8318 tcsh co-compute2 4.2M 7.7M S 168 00:00:00 00:00:01
16522 16361 [5548]*1 co-compute2 3.3M 6.2M S 168 00:00:00 00:00:00
16523 16522 malloc8g co-compute2 4.0G 4.0G R 168 00:00:11 00:00:02
5342 5340 sshd co-viz8 10.0M 19.2M S 0 00:00:00 00:00:00
5343 5342 tcsh co-viz8 4.0M 6.4M S 0 00:00:00 00:00:00
8804 13124 watch co-viz8 2.2M 3.7M S 1 00:00:00 00:00:00
8807 5343 qps co-viz8 30.3M 34.1M S 0 00:00:02 00:00:00
8917 8807 pminfo co-viz8 1.7M 3.7M S 4 00:00:00 00:00:00
12715 12700 sshd co-viz8 10.0M 19.2M S 1 00:00:00 00:00:00
13124 12715 tcsh co-viz8 4.1M 6.4M S 1 00:00:00 00:00:00
5. Unaligned Access messages
When you run a program on ProPack 4, you may see messages like the
following:
a.out(6337): unaligned access to 0x60000fffffffa81c
The above message indicates that data in the program are not all aligned on
word boundaries (a requirement of the Itanium 2 processor). This does not
cause the program to fail but it can slow the performance of the code. If
you see the message in ProPack 4, the problem also occurred under ProPack 3
although the message was not displayed.
The "unaligned access" problem may occur in Fortran programs due to the
order of variables in a Fortran COMMON statement. To prevent the problem,
add the following option to the ifort command:
-align all
See the ifort man page for details.
6. Other user impact