NCSA Home
Contact Us | Intranet | Search

ProPack 3 to ProPack 4 transition notes

  1. Introduction
  2. ProPack 3 vs. ProPack 4
  3. PBS job memory enforcement
  4. Checking memory use
  5. Unaligned Access messages
  6. Other user impact

1. Introduction

The NCSA SGI Altix system will go into production under SGI ProPack 4 on Monday September 25, 2006.

Please note the following important change in ProPack 4: Memory specification with PBS jobs is enforced. You can expect your jobs to be terminated if they exceed their requested memory. See the section PBS job memory enforcement for details.

As of September 19, new jobs are not being accepted to run under ProPack 3. This means that no new jobs may be submitted on co-login1.ncsa.uiuc.edu. We have begun accepting job submissions to run in the production environment via co-login2.ncsa.uiuc.edu - these will run starting September 25. They should be submitted explicitly to the standard queue via the PBS -q directive.

Until that date, the default queue is set to fuser (for friendly user). Please submit jobs to this queue to run now. Be aware that at the start of production, any remaining queued jobs in the fuser queue will be purged.

2. ProPack 3 vs. ProPack 4

SGI ProPack 3 was based on Red Hat Enterprise Linux Advanced Server 3 which uses the Linux 2.4 kernel. ProPack 4 is based on SUSE Linux Enterprise Server 9 (SLES9) which uses the Linux 2.6 kernel. Here is a brief summary of the differences:

                         ProPack 3       ProPack 4
                         ---------       ---------
      Distribution       RedHat          SLES9
      Linux Kernel       2.4.21          2.6.5
      glibc              2.3.2-95        2.3.3-98
      batch system       pbs-5.4.1       pbs-7.1.1

3. PBS job memory enforcement

On ProPack 3, batch jobs could use memory associated with processors that were assigned to other jobs. Starting with ProPack 4, jobs are limited to the memory associated with the processors assigned to the job. You can expect your jobs to be terminated if they exceed their requested memory. Some jobs using small cpu counts and only a couple of gigabytes of memory, may occasionally "get away" with using more memory and run one time while getting killed the next time. If this happens with your job, request more memory as directed by the email you receive from the job kill daemon.

Here is an example of a over-memory job showing the memory requested, used, and the job standard error file and email sent to the user.
% qsub -lncpus=1,mem=2gb,walltime=00:15:00 -N malloctest myscript.pbs

% qhist 5401

Scanning PBS raw accounting records: 09/08/2005 - 08/26/2007

Compute Host:       co-compute2:ssinodes=1:ncpus=2:mem=12065792kb
JobId:              5401
JobName:            malloctest
User:               arnoldg
Project:            0x8e8dca1e0000025b
Queue:              standard

Job limits:
  wall clock:       00:15:00    
  Requested CPUs:   1        
  Available CPUs:   2        
  Requested Memory: 2097152kb 

Queued:             09/14/06 09:02
Started:            09/14/06 09:03
Ended:              09/14/06 09:07

Usage:
  wall clock:       00:03:20    
     cputime:       00:01:03    
         SUs:       0.02        
      memory:         2.53G     

[arnoldg@co-login1 ~/c]$ cat *.e5401
set_SCR: using existing PBS job directory /scratch/batch/5401
JOB_OVER_MEMORY

This email notification was sent to the user informing them about what happened to the job:

Date: Thu, 14 Sep 2006 09:05:23 -0500
To: arnoldg@ncsa.uiuc.edu
Subject: Job arnoldg5401.co-master Killed

arnoldg :
Your job 5401.co-master was killed from host co-compute2 because it 
attempted to use more memory than exists within the processor sets 
allocated to your job.	Please modify your batch script to request 
more memory for your job.  For example, to request 10 Gbytes of 
memory for the job use the following line in your batch script:

      #PBS -l mem=10gb

to request 10 Gbytes of memory (total) for your batch job.

4. Checking memory use

The cobalt cluster monitor web page: co-monitor.ncsa.uiuc.edu can show you job detail including memory use for running jobs. Select the Jobs link on the left, then click on your RUNNING job in the list and see the vmem field under the resources_used section of the page. Here's a sample web display for a job while it was running.

You can also check your memory use for jobs already run with the qhist command. This command will show a table of your jobs run in the last 2 days and their memory use:

% qhist -g 2 -u $USER -f jobid,jobname,usedmem

Scanning PBS raw accounting records: 09/11/2006 - 08/26/2007


  JobId  JobName        UsedMem
-------------------------------
   5230  test             6.86M
   5231  test             9.58M
   5340  test            14.31M
   5401  malloctest       9.23G
-------------------------------
Total # jobs = 4
Total # SUs  = 0.02
An alternate way to check processes currently running in batch is the qps command:
% qps
  PID  PPID      COMMAND        HOST     RSS    SIZE  S CPU user time system tm
_____ _____ ____________ ___________ _______ _______  _ ___ _________ _________
13640 13587         tcsh   co-login1    3.4M   38.4M  S  27  00:00:00  00:00:00
16361  8318         tcsh co-compute2    4.2M    7.7M  S 168  00:00:00  00:00:01
16522 16361     [5548]*1 co-compute2    3.3M    6.2M  S 168  00:00:00  00:00:00
16523 16522     malloc8g co-compute2    4.0G    4.0G  R 168  00:00:11  00:00:02
 5342  5340         sshd     co-viz8   10.0M   19.2M  S   0  00:00:00  00:00:00
 5343  5342         tcsh     co-viz8    4.0M    6.4M  S   0  00:00:00  00:00:00
 8804 13124        watch     co-viz8    2.2M    3.7M  S   1  00:00:00  00:00:00
 8807  5343          qps     co-viz8   30.3M   34.1M  S   0  00:00:02  00:00:00
 8917  8807       pminfo     co-viz8    1.7M    3.7M  S   4  00:00:00  00:00:00
12715 12700         sshd     co-viz8   10.0M   19.2M  S   1  00:00:00  00:00:00
13124 12715         tcsh     co-viz8    4.1M    6.4M  S   1  00:00:00  00:00:00

5. Unaligned Access messages

When you run a program on ProPack 4, you may see messages like the following:


  a.out(6337): unaligned access to 0x60000fffffffa81c

The above message indicates that data in the program are not all aligned on word boundaries (a requirement of the Itanium 2 processor). This does not cause the program to fail but it can slow the performance of the code. If you see the message in ProPack 4, the problem also occurred under ProPack 3 although the message was not displayed.

The "unaligned access" problem may occur in Fortran programs due to the order of variables in a Fortran COMMON statement. To prevent the problem, add the following option to the ifort command: -align all

See the ifort man page for details.

6. Other user impact

  • In general, programs built on ProPack 3 should run without problems on ProPack 4. Please let us know if you have to recompile your program to get it to work with ProPack 4.

  • HDF/HDF5 (Hierarchical Data Format) libraries: A recompile is recommended.

  • PAPI (Performance Application Programming Interface) library: A recompile may be necessary if you used it separately, otherwise, things are the same.

  • If you need to use old Linux pthreads with ProPack 4, set the following:
             setenv LD_ASSUME_KERNEL 2.4.21