Memory Placement on the Origin2000
- Introduction
- Data Placement
- Environment Variables to control Data Placement
Since the Origin2000 is a NUMA (Non-Uniform Memory Access) system,
the time it takes for a CPU to access a memory location varies according to
the location of the memory relative to
the CPU. Therefore, performance of a program is best when
its data is in the memory closest to the CPU that is executing the code.
This document explains the basics of data placement on the Origin, and also
information on a change in the default of data placement on NCSA's Origin
system for shared memory programs (programs compiled with
-pfa
or
-mp) starting April 5, 1999.
The bulk of the material in this document was taken from SGI's
Origin2000 and Onyx2 Performance Tuning and Optimization Guide. Its
reading is recommended for the details. Chapter 1 and 2 include
the Origin2000 architecture and memory management, and Chapter 8 has details
on Tuning for Parallel Processing.
There are three different data placement policies on the Origin:
First-Touch: Under this policy,
the process that first touches (that is, writes to, or reads
from) a page of memory causes that page to be allocated in the node on which the process is running.
Round-Robin:
Under this policy, data are allocated in a round-robin fashion from all the
nodes the program runs on.
Fixed: Under this policy, pages of memory are placed in specific locations
by the programmer using compiler placement directives.
The initial placement of data is important for consistently achieving high
performance on parallelized applications. For serial applications with memory
requirements that will fit on the node on which it runs, this is not an issue.
Large memory serial applications, however, will have the problem of
having to use memory from other nodes. The NCSA Origin systems have either
512Mb or 1Gb memory per node (each node has 2 processors), so codes requiring
more memory than this should be tuned to run in parallel for best performance.
See the NCSA Origin2000
Technical Summary for memory available on each machine.
First-touch is currently the default policy for all programming models. It
works well with single-threaded programs, because it keeps memory close to
the program's one process. It also works well for programs that have been
parallelized completely, so that each parallel thread allocates and
initializes the memory it uses.
For example, this is just what you
want for message-passing programs that run on the Origin. In such programs
each process has its own separate data space. Except for messages sent
between the processes, all processes use memory that should be local. Each
process initializes its own data, so memory is allocated from the node the
process is running in, thus making the accesses local.
However, for programs where initialization
is done in serial, and subsequent computations are done in parallel,
first-touch does not work well, especially when scaling to large number
of processors because memory will be allocated on a single node. With
round-robin, even if the data are initialized sequentially, the memory holding
them will not be allocated from a single node; it will be evenly spread out
among all the nodes running the program.
A simple example program illustrates this:
program test
dimension x(NN,NN,NN)
c... Initialization (array x is first touched here)
do k = 1,nn
do j = 1,nn
do i = 1,nn
x(i,j,k) = 0.
enddo
enddo
enddo
c... Main calculation loop - perhaps an iterative or time marching loop
do n = 1,10000
!$omp parallel
call compute(X,NN)
!$omp end parallel
enddo
stop
end
As in the above example, it is common for users to parallelize the
compute
subroutine because that IS the compute intensive part of their code.
However, when the user does not parallelize the initialization loop,
with the first-touch data placement policy, all the program's memory will
be allocated on the node running the main thread.
If
compute is parallelized and presumably the work in
compute
involves accesses to elements of
x, when the user is running on
M processors, each access to an element
of
x will involve
M-1 remote memory fetches. Remote memory
accesses are slower than local memory accesses and hence the user will pay
a penalty. This penalty can be avoided by doing the initialization in parallel:
!$omp parallel do
do k = 1,nn
do j = 1,nn
do i = 1,nn
x(i,j,k) = 0.
enddo
enddo
enddo
Another benefit to initializing in parallel is the gain in runtime, especially
for large programs or when scaling to a large number of processors.
Also see the
Bandwidth Plot for a comparision of performance of different placement
policies of the double-precision vector operation
a(i) = b(i) + q*c(i).
Sometimes it may not be possible to parallelize the initialization loop -
in these cases, the data can be spread across the processors that the user
is using with the round-robin data placement policy. This will also have a
effect similar to that of the first-touch policy with a parallelized
initialization.
Note that in real applications, the actual memory access patterns will be
complicated. These patterns are dictated by the numerical algorithm used and
the programming style adopted. In addition to the first-touch and robin-robin
policies, SGI provides for various directives to fine-tune
data placement for optimal performance. These are described in detail in the
section
Using Data Distribution Directives of SGI's
Origin2000 and Onyx2 Performance Tuning and Optimization Guide.
First-touch vs. Round-robin for parallel jobs
The following example illustrates the difference between first-touch and
round-robin memory allocation.
Assuming we are running on 4 processors and have an array dimension a(8), then
elements would get allocated on the processors in the following way:
PROC # ROUND_ROBIN FIRST_TOUCH
1 A(1),A(5) A(1),A(2)
2 A(2),A(6) A(3),A(4)
3 A(3),A(7) A(5),A(6)
4 A(4),A(8) A(7),A(8)
Note that if you do not allocate data in parallel and instead have a serial
loop 'touch' data first, then all elements of A would get allocated on PROC 1.
There is a simple way to control memory allocation for shared memory programs
(programs compiled with -mp or -pfa) by the use of the environment variable
_DSM_PLACEMENT. From
man pe_environ:
_DSM_PLACEMENT Allocates memory for all stack, data, and text
segments. This environment variable accepts the
following values:
Value Action
FIRST_TOUCH Specifies first-touch data placement.
Currently the default.
ROUND_ROBIN Specifies round-robin data allocation.
Starting April 5, 1999, the default data placement for shared memory programs
will be changed to round-robin. Users who have programmed their codes to
optimally make use of first-touch data placement can continue to use it with
the above environment variable.
In the C-shell, the command is:
setenv _DSM_PLACEMENT FIRST_TOUCH
Note:
For MPI and SHMEM programs, the default policy will continue to be first-touch.