NCSA Home
Contact Us | Intranet | Search

ncsa

User Information Home
Compute Resources
Software
Data
Security
Allocations
Consulting
Training
Strategic Applications Program

NCSA's Help Desk is available 24 hours a day, seven days a week, 365 days a year:
help.ncsa.uiuc.edu
217-244-0710
help@ncsa.uiuc.edu

Memory Placement on the Origin2000

  1. Introduction
  2. Data Placement
  3. Environment Variables to control Data Placement

1. Introduction

Since the Origin2000 is a NUMA (Non-Uniform Memory Access) system, the time it takes for a CPU to access a memory location varies according to the location of the memory relative to the CPU. Therefore, performance of a program is best when its data is in the memory closest to the CPU that is executing the code. This document explains the basics of data placement on the Origin, and also information on a change in the default of data placement on NCSA's Origin system for shared memory programs (programs compiled with -pfa or -mp) starting April 5, 1999.

The bulk of the material in this document was taken from SGI's Origin2000 and Onyx2 Performance Tuning and Optimization Guide. Its reading is recommended for the details. Chapter 1 and 2 include the Origin2000 architecture and memory management, and Chapter 8 has details on Tuning for Parallel Processing.

2. Data Placement

There are three different data placement policies on the Origin:

First-Touch: Under this policy, the process that first touches (that is, writes to, or reads from) a page of memory causes that page to be allocated in the node on which the process is running.

Round-Robin: Under this policy, data are allocated in a round-robin fashion from all the nodes the program runs on.

Fixed: Under this policy, pages of memory are placed in specific locations by the programmer using compiler placement directives.

The initial placement of data is important for consistently achieving high performance on parallelized applications. For serial applications with memory requirements that will fit on the node on which it runs, this is not an issue. Large memory serial applications, however, will have the problem of having to use memory from other nodes. The NCSA Origin systems have either 512Mb or 1Gb memory per node (each node has 2 processors), so codes requiring more memory than this should be tuned to run in parallel for best performance. See the NCSA Origin2000 Technical Summary for memory available on each machine.

First-touch is currently the default policy for all programming models. It works well with single-threaded programs, because it keeps memory close to the program's one process. It also works well for programs that have been parallelized completely, so that each parallel thread allocates and initializes the memory it uses. For example, this is just what you want for message-passing programs that run on the Origin. In such programs each process has its own separate data space. Except for messages sent between the processes, all processes use memory that should be local. Each process initializes its own data, so memory is allocated from the node the process is running in, thus making the accesses local.

However, for programs where initialization is done in serial, and subsequent computations are done in parallel, first-touch does not work well, especially when scaling to large number of processors because memory will be allocated on a single node. With round-robin, even if the data are initialized sequentially, the memory holding them will not be allocated from a single node; it will be evenly spread out among all the nodes running the program.

A simple example program illustrates this:

      program test
      dimension x(NN,NN,NN)

c... Initialization (array x is first touched here)

      do k = 1,nn
         do j = 1,nn
            do i = 1,nn
               x(i,j,k) = 0.
            enddo
         enddo
      enddo

c... Main calculation loop - perhaps an iterative or time marching loop

     do n = 1,10000
!$omp parallel
        call compute(X,NN)
!$omp end parallel
     enddo

     stop
     end

As in the above example, it is common for users to parallelize the compute subroutine because that IS the compute intensive part of their code. However, when the user does not parallelize the initialization loop, with the first-touch data placement policy, all the program's memory will be allocated on the node running the main thread. If compute is parallelized and presumably the work in compute involves accesses to elements of x, when the user is running on M processors, each access to an element of x will involve M-1 remote memory fetches. Remote memory accesses are slower than local memory accesses and hence the user will pay a penalty. This penalty can be avoided by doing the initialization in parallel:
!$omp parallel do
      do k = 1,nn
         do j = 1,nn
            do i = 1,nn
               x(i,j,k) = 0.
            enddo
         enddo
      enddo
Another benefit to initializing in parallel is the gain in runtime, especially for large programs or when scaling to a large number of processors.

Also see the Bandwidth Plot for a comparision of performance of different placement policies of the double-precision vector operation a(i) = b(i) + q*c(i).

Sometimes it may not be possible to parallelize the initialization loop - in these cases, the data can be spread across the processors that the user is using with the round-robin data placement policy. This will also have a effect similar to that of the first-touch policy with a parallelized initialization.

Note that in real applications, the actual memory access patterns will be complicated. These patterns are dictated by the numerical algorithm used and the programming style adopted. In addition to the first-touch and robin-robin policies, SGI provides for various directives to fine-tune data placement for optimal performance. These are described in detail in the section Using Data Distribution Directives of SGI's Origin2000 and Onyx2 Performance Tuning and Optimization Guide.

First-touch vs. Round-robin for parallel jobs

The following example illustrates the difference between first-touch and round-robin memory allocation.

Assuming we are running on 4 processors and have an array dimension a(8), then elements would get allocated on the processors in the following way:

PROC #         ROUND_ROBIN      FIRST_TOUCH
1              A(1),A(5)        A(1),A(2)
2              A(2),A(6)        A(3),A(4)
3              A(3),A(7)        A(5),A(6)
4              A(4),A(8)        A(7),A(8)
Note that if you do not allocate data in parallel and instead have a serial loop 'touch' data first, then all elements of A would get allocated on PROC 1.

3. Environment Variables to control Data Placement

There is a simple way to control memory allocation for shared memory programs (programs compiled with -mp or -pfa) by the use of the environment variable _DSM_PLACEMENT. From man pe_environ:
   _DSM_PLACEMENT Allocates memory for all stack, data, and text
                    segments.  This environment variable accepts the
                    following values:

                    Value          Action

                    FIRST_TOUCH    Specifies first-touch data placement.
                                   Currently the default.

                    ROUND_ROBIN    Specifies round-robin data allocation.
Starting April 5, 1999, the default data placement for shared memory programs will be changed to round-robin. Users who have programmed their codes to optimally make use of first-touch data placement can continue to use it with the above environment variable. In the C-shell, the command is:
setenv _DSM_PLACEMENT FIRST_TOUCH
Note: For MPI and SHMEM programs, the default policy will continue to be first-touch.