NCSA Home
Contact Us | Intranet | Search

ncsa

Understanding Performance on the SGI Origin

The following sections are excerpted from the complete set of
online documentation pages furnished by SGI:

1 Origin2000 Architecture

To extract top performance from a computer system, it is important to understand the architecture of the system.This section describes the architecture of the Origin2000 system and the components that it is made up of, including the R10000 microprocessor.

1.1 Introduction: Scalable Shared Memory

The Origin2000 is a scalable shared memory multiprocessor (S2MP). In this section, we describe what this phrase means and why the Origin2000 uses this new architecture. Since there is always a limit to the performance of a single processor, computer manufacturers have increased the performance of their systems by incorporating multiple processors. Two approaches for utilizing multiple processors have emerged: shared memory implementations and distributed memory implementations.

Previous shared memory computer designs --- including Silicon Graphics's Power Series, Challenge and Power Challenge systems --- utilized a bus-based architecture. In this architecture, processor and memory boards plug into a common bus; communication between the hardware components occurs on the bus.

This architecture has many desirable properties. Each processor has direct access to all the memory in the system. This allows efficient multitasking and multiuser use of the system through symmetric multiprocessing. In addition, parallel programs are relatively easy to implement on these types of systems. Parallelism may be achieved by simply inserting directives into the code to distribute the iterations of loops among the processors. An example of this is shown below:

c$doacross local(j,k,i), shared(n,a,r,s), mp_schedtype=simple
      do j = 1, n
         do k = 1, n
            do i = 1, n
               a(i,j) = a(i,j) + r(i,k)*s(k,j)
            enddo
         enddo
      enddo

Here, the simple schedule type means that blocks of consecutive iterations are assigned to each processor. That is, if there are p processors, the first processor is responsible for executing the first ceil(n/p) iterations, the second processor executes the next ceil(n/p) iterations, and so on. Loops without parallelization directives run sequentially.

One advantage of this shared memory parallelization style is that it can be done incrementally. That is, the user can identify the most time-consuming loop in the program and parallelize just it. The program can then be run to validate correctness and measure performance. If a sufficient speedup has been achieved, the parallelization is complete. If not, additional loops can be parallelized one at a time.

Another advantage of the shared memory parallelization is that loops can be parallelized without being concerned about which processors have accessed data elements in other loops. In the parallelized loop above, the first processor is responsible for updating the first ceil(n/p) columns of the array a. In the loop below, an interleave schedule type is used, which means that the first processor will access every p-th column of a (different schedule types are used to distribute work more evenly among the processors):

c$doacross local(kb,k,t), shared(n,a,b), mp_schedtype=interleave
      do kb = 1, n
         k = n + 1 - kb
         b(k) = b(k)/a(k,k)
         t = -b(k)
         call daxpy(k-1,t,a(1,k),1,b(1),1)
      enddo

Since data are equally accessible by all processors, nothing special needs to be done for a processor to access data that were last touched by another processor; each processor simply references the data it requires. This makes parallelization easier since the programmer only needs to make sure the processors have an equal amount of work to do; it doesn't matter which subset of the work a particular processor does.

1.1.1 Distributed Memory Message Passing

Let's now contrast the shared memory approach with the distributed memory approach. In a distributed memory computer, there is no common bus to which memory attaches. Each processor has its own local memory, which only it can access directly. For a processor to have access to the local memory of another processor, a copy of the desired data elements must be sent from one processor to the other. Typically, this data communication, called message-passing, is accomplished using a software library. MPI and PVM are two commonly used message-passing libraries.

To run a program on a message-passing machine, the programmer must decide how the data are to be distributed among the local memories. All data structures that are to be operated on in parallel must be distributed. In this approach, incremental parallelism is not possible. The key loops in a program will typically reference many data structures; each must be distributed in order to run that loop in parallel. Once a data structure has been distributed, all sections of code that reference it must be run in parallel. The result is an all-or-nothing approach to parallelism on distributed memory machines. In addition, note that, since many programs's data structures are bigger than a single local memory, it may not be possible to run the program on the machine without first parallelizing it.

In the shared memory section above, we saw that different loops of a program may be parallelized in different ways to ensure an even allocation of work among the processors. On a distributed memory machine, this translates into data structures needing to be distributed differently in different sections of the program. Redistributing the data structures is the responsibility of the programmer, who must find efficient ways to accomplish the needed data shuffling. Such issues never arise in the shared memory implementation.

The bottom line is that programming is much easier on a shared memory computer.

1.3.1 Four-way Superscalar Architecture

The R10000 is the latest microprocessor from MIPS Technologies. It is designed to solve many of the performance bottlenecks common in existing microprocessor implementations. In the Origin2000, the R10000 runs at either 195 MHz or 180 MHz. The R10000 is a four-way superscalar RISC processor. It can fetch and decode four instructions per cycle to be run on its five independent, pipelined execution units: a nonblocking load store unit, two 64-bit integer ALUs, a 32-/64-bit pipelined floating point adder, and a 32-/64-bit pipelined floating point multiplier. The integer ALUs are asymmetric. Although both perform add, subtract, and logical operations, ALU 1 handles shifts, conditional branch and conditional move instructions, and ALU 2 is responsible for integer multiplies and divides. Similarly, instructions are partitioned between the floating point units. The floating point adder is responsible for add, subtract, absolute value, negate, round, truncate, ceiling, floor, conversions, and compare operations, and the floating point multiplier carries out multiplication, division, reciprocal, square root, reciprocal square root, and conditional move instructions; the two units can be chained together to perform multiply-add and multiply-subtract operations.

1.3.2 MIPS IV Instruction Set Architecture

The R10000 implements the MIPS IV instruction set architecture (ISA). This is a superset of the previous MIPS I, II, and III ISAs, so programs compiled for those ISAs are binary-compatible with the R10000. The MIPS IV ISA is also used in the R8000, so R8000 programs are also binary-compatible with the R10000.

In addition to the 32- and 64-bit instructions contained in the previous MIPS ISAs, MIPS IV also includes floating point multiply-add (madd), reciprocal, and reciprocal square root instructions, indexed loads and stores, prefetch instructions, and conditional moves. Although MIPS I, II, and III load and store operations allow addresses to be constructed as a compile-time constant offset from a base register, indexed loads and stores allow the offset to be a run-time value contained in an integer register; these instructions allow you to reduce the number of address increments inside loops. Prefetch instructions allow you to request data to be moved into the cache well before it is needed, thus eliminating much of the latency of a cache miss. Conditional move instructions can be used to replace branches inside loops, thus allowing tight, superscalar code to be generated for those loops. The MIPS IV ISA specifies the availiblility of 32 integer and 32 floating point registers.

1.3.3 Cache Architecture

The R10000 has a two-level cache hierarchy. Located on the microprocessor chip are a 32 KB, two-way set associative level-1 instruction cache and a 32 KB, two-way set associative, two-way interleaved level-1 (L1) data cache. Off-chip is a two-way set associative, unified (instructions and data) level-2 (L2) cache. This secondary cache may range in size from 512 KB to 16 MB; the size of the secondary cache in the Origin2000 is 4 MB for 195 MHz systems, and 1 MB for 180 MHz systems. The L1 instruction cache uses a line size of 64 B, while the L1 data cache has a line size of 32 B. The line size of the L2 cache may be either 64 B or 128 B; in the Origin2000 it is 128 B. Both the L1 data cache and the L2 unified cache employ a least recently used (LRU) replacement policy for selecting in which set of the cache to place a new cache line. The secondary cache may be run at a handful of speeds, ranging from the same speed as the processor down to one-third of that frequency. In Origin2000 the secondary cache operates at two-thirds of the processor frequency. The R10000 is nearly unique in allowing for a set associative off-chip cache. To provide a cost-effective implementation, however, only enough wires are provided to check for a cache hit in one set of the secondary cache at a time. To allow for two-way functionality, an 8,192-entry way prediction table is used to record which set of a particular cache address was most recently used. This set is checked first to determine whether there is a cache hit. This takes one cycle and is performed concurrently with the transfer of the first half of the set's cache line. The other set is checked on the next cycle while the second half of the cache line for the first set is being transferred. If there is a hit in the first set, no extra cache access time is incurred. If the hit occurs in the second set, its data must be read from the cache and a minimum four-cycle mispredict penalty is incurred.

3.3.1 Profiling Tools

The hardware counters in the R10000 CPU make it possible to profile the behavior of a program in many ways without modifying the code. The software tools are perfex, which runs a program and reports exact counts of any two selected events from the R10000 counters. Alternatively, it time-multiplexes all 32 countable events and reports extrapolated totals of each. Perfex is useful for identifying what problem (for example, secondary data cache misses) is hurting the performance of your program the most. (see timex for simple timing functionality.) The Perfex functions are also available as callable library functions in libperfex. Similarly, for speedshop, the ssapi library is available. Speedshop (actually, the ssrun command), which runs a program while sampling the state of the program counter and stack, and writing the sample data to a file for later analysis. You select the timebase for the sampling and the particular type of information to be sampled. SpeedShop is useful for locating where in your program the performance problems occur. Prof, which analyzes a Speedshop data file and displays it in a variety of formats. Dprof, which, like Speedshop, samples a program while it is executing but records memory access information as a histogram file. It identifies which data structures in the program are involved in performance problems. Use these tools to find out what constrains the program and which parts of it consume the most time. Through the use of a combination of these tools, it is possible to identify most performance problems.

3.3.2 Hardware Counter Registers

The profiling tools depend for most of their features on the R10000's performance counter registers. These on-chip registers can be programmed to count hardware events as they happen, for example, machine cycles, instructions, branch predictions, floating point instructions, or cache misses. There are only two performance counter registers. Each can be programmed to count machine cycles or 1 of 15 other events, for a total of 32 events that can be counted (30 of which are distinct). The specific events are summarized in the following table, which can be obtained by using the command perfex -h.


 0 Cycles
                                          16 Cycles
 1 Instructions issued to functional units
                                          17 Instructions graduated
 2 Memory data access (load, prefetch, sync,
        cacheop) issued
                                          18 Memory data loads graduated
 3 Memory stores issued
                                          19 Memory data stores graduated
 4 Store conditionals issued
                                          20 Store conditionals graduated
 5 Store conditionals failed
                                          21 Floating point instructions graduated
 6 Branches decoded
                                          22 Quadwords written back from L1 cache
 7 Quadwords written back from L2 cache
                                          23 TLB refill exceptions
 8 Correctable ECC errors on L2 cache
                                          24 Branches mispredicted
 9 L1 cache misses (instruction)
                                          25 L1 cache misses (data)
 10 L2 cache misses (instruction)
                                          26 L2 cache misses (data)
 11 L2 cache way mispredicted (instruction)
                                          27 L2 cache way mispredicted (data)
 12 External intervention requests
                                          28 External intervention request hits in L2 cache
 13 External invalidate requests
                                          29 External invalidate request hits in L2 cache
 14 Instructions done (formerly, virtual coherency
        condition)
                                          30 Stores, or prefetches with store hint, to CleanExclusive
                                                 L2 cache blocks
 15 Instructions graduated
                                          31 Stores, or prefetches with store hint, to Shared L2 cache
                                                 blocks.

The counters are 64-bit integers. When a counter overflows, a hardware trap occurs. The kernel can preload a counter with 2^64 - n to cause a trap after n counts occur. The profiling tools use this capability. For example, the command ssrun -gi_hwc programs the graduated instruction counter (event 17) to overflow every 32 K counts. Each time the counter overflows, ssrun samples the program counter and stack state of the subject program. The reference page r10k_counters(5) gives detailed information on how the counters can be accessed through the /proc interface. This is the interface used by the profiling tools. The interface hides the division of events between only two registers and allows the software to view the counters as a single set of thirty-two 64-bit counters. The operating system time-multiplexes the active counters between the events being counted. This requres sampling and scaling, which introduce some error when more than two events are counted. In general, it is better to access the counters through the profiling tools. A program that uses the counter interface directly cannot be profiled using perfex or using ssrun for any experiment that depends on counters. When a program must access counter values directly, the simplest interface is through libperfex, documented in the libperfex reference page.

3.3.3 How to Do Performance Analysis Using Perfex

The simplest profiling tool is perfex, documented in the perfex reference page. It runs a subject program and records data about the run, similar to timex:

% perfex [options] command [arguments]

The subject program and its arguments are given. perfex sets up the counter interface and forks the subject program. When the program ends, perfex writes counter data to standard output. perfex gathers its information with no modifications to your existing program. Although this is convenient, the data obtained come from the entire run of the program. If you only want to profile a particular section of the program, you need to use the library interface to perfex, libperfex(3). To use this interface, insert a call to initiate counting into your program's source code and another to terminate it; a third call prints the counts gathered. The program must then be linked with the libperfex library:

% f77|cc -o program -lperfex

Since you can use SpeedShop to see where in a program various event counts come from, libperfex is not described in detail. More information can be found in its reference page.

3.3.3.1 Absolute Counts of One or Two Events

Use perfex options to specify what is to be counted. You can specify one or two countable events. In this case, the counts are absolute and accurate. For example, the command

% perfex -e 15 -e 21 adi2

runs the subject program and reports the exact counts of graduated instructions and graduated floating point instructions. You use this mode to explore specific points of program behavior.

3.3.3.2 Statistical Counts of All Events

When you specify option -a (all events), perfex multiplexes all 32 events over the program run. Each count is active 1/16 of the time and then scaled by 16. The resulting counts have some statistical error. The error is small (and the counts sufficiently repeatable) provided that the subject program runs in a stable execution mode for a number of seconds. When the program runs for a short time, or shifts between radically different regimes of instruction or data use, the counts are less dependable and less repeatable. Nevertheless, perfex -a usually gives a good overview of program operation. Here is the perfex command line and output applied to a sample program called adi2:

% perfex -a -x adi2

WARNING: Multiplexing events to project totals--inaccuracy possible.

 Time:      7.990 seconds
 Checksum:  5.6160428338E+06
 0 Cycles......................................................   1645481936
 1 Issued instructions.........................................    677976352
 2 Issued loads................................................    111412576
 3 Issued stores...............................................     45085648
 4 Issued store conditionals...................................            0
 5 Failed store conditionals...................................            0
 6 Decoded branches............................................     52196528
 7 Quadwords written back from scache..........................     61794304
 8 Correctable scache data array ECC errors....................            0
 9 Primary instruction cache misses............................         8560
10 Secondary instruction cache misses..........................          304
11 Instruction misprediction from scache way prediction table..          272
12 External interventions......................................         6144
13 External invalidations......................................        10032
14 Virtual coherency conditions................................            0
15 Graduated instructions......................................    371427616
16 Cycles......................................................   1645481936
17 Graduated instructions......................................    400535904
18 Graduated loads.............................................     90474112
19 Graduated stores............................................     34776112
20 Graduated store conditionals................................            0
21 Graduated floating point instructions.......................     28292480
22 Quadwords written back from primary data cache..............     32386400
23 TLB misses..................................................      5687456
24 Mispredicted branches.......................................       410064
25 Primary data cache misses...................................     16330160
26 Secondary data cache misses.................................      7708944
27 Data misprediction from scache way prediction table.........       663648
28 External intervention hits in scache........................         6144
29 External invalidation hits in scache........................         6864
30 Store/prefetch exclusive to clean block in scache...........      7582256
31 Store/prefetch exclusive to shared block in scache..........         8144

The -x option requests that perfex also gather counts for kernel code that handles exceptions, so the work done by the OS to handle TLB misses is included in these counts.

3.3.3.3 Analytic Output with the -y Option

The raw event counts are interesting, but it is more useful to convert them to elapsed time. Some time estimates are simple, for example, dividing the cycle count by the machine clock rate gives the elapsed run time (1645481936 / 195 MHz = 8.44 seconds). Other events are not as simple and can only be stated in terms of a range of times. For example, the time to handle a primary cache miss varies depending on whether the needed data are in the secondary cache, in memory, or in the cache of another CPU. Analysis of this kind can be requested using perfex -a -x -y. When you use -a , -x, and -y, perfex collects and displays all event counts, but it also displays a report of estimated times based on the counts. Here is an example, again, of the program adi2:

% perfex -a -x -y adi2

WARNING: Multiplexing events to project totals--inaccuracy possible.

 Time:      7.996 seconds
 Checksum:  5.6160428338E+06

                                                                    Based on 196 MHz IP27
                                                                                  Typical      Minimum      Maximum
   Event Counter Name                                          Counter Value   Time (sec)   Time (sec)   Time (sec)
===================================================================================================================
 0 Cycles......................................................   1639802080     8.366337     8.366337     8.366337
16 Cycles......................................................   1639802080     8.366337     8.366337     8.366337
26 Secondary data cache misses.................................      7736432     2.920580     1.909429     3.248837
23 TLB misses..................................................      5693808     1.978017     1.978017     1.978017
 7 Quadwords written back from scache..........................     61712384     1.973562     1.305834     1.973562
25 Primary data cache misses...................................     16368384     0.752445     0.235504     0.752445
22 Quadwords written back from primary data cache..............     32385280     0.636139     0.518825     0.735278
 2 Issued loads................................................    109918560     0.560809     0.560809     0.560809
18 Graduated loads.............................................     88890736     0.453524     0.453524     0.453524
 6 Decoded branches............................................     52497360     0.267844     0.267844     0.267844
 3 Issued stores...............................................     43923616     0.224100     0.224100     0.224100
19 Graduated stores............................................     33430240     0.170562     0.170562     0.170562
21 Graduated floating point instructions.......................     28371152     0.144751     0.072375     7.527040
30 Store/prefetch exclusive to clean block in scache...........      7545984     0.038500     0.038500     0.038500
24 Mispredicted branches.......................................       417440     0.003024     0.001363     0.011118
 9 Primary instruction cache misses............................         8272     0.000761     0.000238     0.000761
10 Secondary instruction cache misses..........................          768     0.000290     0.000190     0.000323
31 Store/prefetch exclusive to shared block in scache..........        15168     0.000077     0.000077     0.000077
 1 Issued instructions.........................................    673476960     0.000000     0.000000     3.436107
 4 Issued store conditionals...................................            0     0.000000     0.000000     0.000000
 5 Failed store conditionals...................................            0     0.000000     0.000000     0.000000
 8 Correctable scache data array ECC errors....................            0     0.000000     0.000000     0.000000
11 Instruction misprediction from scache way prediction table..          432     0.000000     0.000000     0.000002
12 External interventions......................................         6288     0.000000     0.000000     0.000000
13 External invalidations......................................         9360     0.000000     0.000000     0.000000
14 Virtual coherency conditions................................            0     0.000000     0.000000     0.000000
15 Graduated instructions......................................    364303776     0.000000     0.000000     1.858693
17 Graduated instructions......................................    392675440     0.000000     0.000000     2.003446
20 Graduated store conditionals................................            0     0.000000     0.000000     0.000000
27 Data misprediction from scache way prediction table.........       679120     0.000000     0.000000     0.003465
28 External intervention hits in scache........................         6288     0.000000     0.000000     0.000000
29 External invalidation hits in scache........................         5952     0.000000     0.000000     0.000000

Statistics
=========================================================================================
Graduated instructions/cycle................................................     0.222163
Graduated floating point instructions/cycle.................................     0.017302
Graduated loads & stores/cycle..............................................     0.074595
Graduated loads & stores/floating point instruction.........................     5.422486
Mispredicted branches/Decoded branches......................................     0.007952
Graduated loads/Issued loads................................................     0.808696
Graduated stores/Issued stores..............................................     0.761099
Data mispredict/Data scache hits............................................     0.078675
Instruction mispredict/Instruction scache hits..............................     0.057569
L1 Cache Line Reuse.........................................................     6.473003
L2 Cache Line Reuse.........................................................     1.115754
L1 Data Cache Hit Rate......................................................     0.866185
L2 Data Cache Hit Rate......................................................     0.527355
Time accessing memory/Total time............................................     0.750045
L1--L2 bandwidth used (MB/s, average per process)...........................   124.541093
Memory bandwidth used (MB/s, average per process)...........................   236.383187
MFLOPS (average per process)................................................     3.391108

"Maximum," "minimum," and "typical" time cost estimates are reported. Each is obtained by consulting an internal table which holds the maximum, minimum, and typical costs for each event, and multiplying this cost by the count for the event. Event costs are usually measured in terms of machine cycles, and so the cost of an event generally depends on the clock speed of the processor, which is also reported in the output. The maximum value in the table corresponds to the worst-case cost of a single occurrence of the event. Sometimes this can be a pessimistic estimate. For example, the maximum cost for graduated floating point instructions assumes that every floating point instruction is a double-precision reciprocal square root since it is the most costly R10000 floating point instruction. Because of the latency-hiding capabilities of the R10000, the minimum cost of virtually any event could be zero since most events can be overlapped with other operations. To avoid simply reporting minimum costs of zero, which would be of no practical use, the minimum time reported by perfex -y corresponds to the best-case cost of a single occurrence of the event. The best-case cost is obtained by running the maximum number of simultaneous occurrences of that event and averaging the cost. For example, two floating point instructions can complete per cycle, so the best case cost is 0.5 cycles per floating point instruction. The typical cost falls somewhere between minimum and maximum and is meant to correspond to the cost you see in average programs. perfex -y prints the event counts and associated cost estimates sorted from most costly to least costly. Although resembling a profiling output, this is not a true profile. The event costs reported are only estimates. Furthermore, since events do overlap with one another, the sum of the estimated times will usually exceed the program's run time. This output should only be used to identify which events are responsible for significant portions of the program's run time and to get a rough idea of what those costs might be. In the example above, the program spends a significant fraction of its time handling secondary cache and TLB misses. To make a significant improvement in the run time of this program, the tuning measures need to concentrate on reducing those cache misses. In addition to the event counts and cost estimates, perfex -y also reports a number of statistics derived from the typical costs. The meaning of many of the statistics is self-evident, for example, Graduated instructions/cycle. Below is a list of those statistics whose definitions require more explanation:

Data mispredict/Data scache hits 
     The ratio of the counts for data misprediction from scache way prediction table and secondary data cache misses. 
Instruction mispredict/Instruction scache hits 
     The ratio of the counts for instruction misprediction from scache way prediction table and secondary instruction cache
     misses. 
L1 Cache Line Reuse 
     The number of times, on average, that a primary data cache line is used after it has been moved into the cache. It is
     calculated as graduated loads plus graduated stores minus primary data cache misses, divided by primary data cache
     misses. 
L2 Cache Line Reuse 
     The number of times, on average, that a secondary data cache line is used after it has been moved into the cache. It is
     calculated as primary data cache misses minus secondary data cache misses, divided by secondary data cache misses. 
L1 Data Cache Hit Rate 
     The fraction of data accesses that are satisfied from a cache line already resident in the primary data cache. It is
     calculated as 1.0 - (primary data cache misses divided by the sum of graduated loads and graduated stores). 
L2 Data Cache Hit Rate 
     The fraction of data accesses that are satisfied from a cache line already resident in the secondary data cache. It is
     calculated as 1.0 - (secondary data cache misses divided by 
     primary data cache misses). 
Time accessing memory/Total time 
     The sum of the typical costs of graduated loads, graduated stores, primary data cache misses, secondary data cache
     misses, and TLB misses, divided by the total program run time. The total program run time is calculated by
     multiplying cycles by the time per cycle (inverse of the processor's clock speed). 
L1--L2 bandwidth used (MB/s, average per process) 
     The amount of data moved between the primary and secondary data caches, divided by the total program run time. The
     amount of data moved is calculated as the sum of the number of primary data cache misses multiplied by the primary
     cache line size and the number of quadwords written back from primary data cache multiplied by the size of a
     quadword (16 bytes). For multiprocessor programs, the resulting figure is a per-process average since the 
     counts measured by perfex are aggregates of the counts for all the threads. Multiply by the number of threads to get
     the total program bandwidth. 
Memory bandwidth used (MB/s, average per process) 
     The amount of data moved between the secondary data cache and main memory, divided by the total program run
     time. The amount of data moved is calculated as the sum of the number of secondary data cache misses multiplied by
     the secondary cache line size and the number of quadwords written back from secondary data cache multiplied by the
     size of a quadword (16 bytes). For multiprocessor programs, the resulting figure is a per-process average since the
     counts measured by perfex are aggregates of the counts for all the threads. Multiply by the number of threads to get
     the total program bandwidth. 
MFLOPS (MB/s, average per process) 
     The ratio of the graduated floating point instructions and the total program run time. Note that although a
     multiply-add carries out two floating point operations, it only counts as one 
     instruction, so this statistic may underestimate the number of floating point operations per second. For multiprocessor
     programs, the resulting figure is a per-process average since the counts measured by perfex are aggregates of the
     counts for all the threads. Multiply by the number of threads to get the total program rate. 

These statistics give you a quick way to identify performance problems in your program. For example, the cache hit-rate statistics tell you how cache friendly your program is. Since a secondary cache miss is much more expensive than a cache hit, the L2 Data Cache Hit Rate needs to be close to 1.0 to indicate that the program is not paying a large penalty for the cache misses. Values of ~0.96 and above indicate good cache performance. Note that, for the above example, the rate is 0.53, further confirmation of the cache problems in this program.

3.3.4 Using SpeedShop

The purpose of profiling is to find out exactly where a program is spending its time, that is, in precisely which procedures or lines of code. Then you can concentrate your efforts on the (usually small) areas of code where there is the most to be gained. Profiling using the SpeedShop package supports these methods:

     Sampling, in which the subject program is frequently interrupted; the program counter (PC) and stack are recorded on
     each interruption. The more frequently the PC is found in a particular procedure, the more execution time that
     procedure costs. SpeedShop can sample on a variety of time bases: the system timer or any of the R10000 performance
     counters.
     Ideal counting, in which a copy of the subject program binary is modified with trap instructions at the end of each
     basic block. During execution, the exact number of uses of each basic block is counted. 
     Exception trace, not really a profiling method, records only floating point exceptions and their locations. 
The SpeedShop package has three parts:

     ssrun performs experiments and collects data . 
     prof processes data and prepares reports. 
     The ssapi interface allows you to insert caliper points into a program to profile specific sections of code or phases of
     execution. 

These programs are documented in the following reference pages: speedshop documents the types of experiments, as well as a number of environment variables you can set. ssrun documents specific options of ssrun. You need both speedshop(1) and ssrun(1) to run an experiment. prof documents the report types and the options you use to get them. ssapi documents the three library calls you can use.

3.3.4.1 PC Sampling Profiling

The accuracy of sampling depends on the time base that sets the sampling interval. In each case, the time base is the independent variable and the program state is the dependent variable. Select from the sampling methods summarized in this table:

   ssrun
   Option
             Time Base
                                                     Comments
 -usertime
           30 ms timer 
                        Fairly coarse resolution; experiment runs quickly and output file is small; some bugs noted
                        in speedshop(1).
 -pcsamp[x]
 -fpcsamp[x]
           10 ms timer
           1 ms timer
                        Moderately coarse resolution; functions that cause cache misses or page faults are
                        emphasized. Suffix x for 32-bit counts.
 -gi_hwc
 -fgi_hwc
           32771 insts
           6553 insts
                        Fine-grain resolution based on graduated instructions. Emphasizes functions that burn a lot
                        of instructions.
 -cy_hwc
 -fcy_hwc
           16411 clocks
           3779 clocks
                        Fine-grain resolution based on elapsed cycles. Emphasizes functions with cache misses and
                        mispredicted branches.
 -ic_hwc
 -fic_hwc
           2053 icache miss
           419 icache miss
                        Emphasizes code that doesn't fit in L1 cache.
 -isc_hwc
 -fisc_hwc
           131 scache miss
           29 scache miss
                        Emphasizes code that doesn't fit in L2 cache. Should be coarse-grained measure.
 -dc_hwc
 -fdc_hwc
           2053 dcache miss
           419 dcache miss
                        Emphasizes code that causes L1 cache data misses.
 -dsc_hwc
 -fdsc_hwc
           131 scache miss
           29 scache miss
                        Emphasizes code that causes L2 cache data misses.
 -tlb_hwc
 -ftlb_hwc
           257 TLB miss
           53 TLB miss
                        Emphasizes code that causes page faults.
 -gfp_hwc
 -fgfp_hwc
           32771 fp insts
           6553 fp insts
                        Emphasizes code that performs heavy FP calculation.
 -prof_hwc
           user-set
                        Hardware counter and overflow value from environment variables.


Each time base finds the program PC more often in the code that consumes the most units of that time base:

     The time bases that reflect actual elapsed time (-usertime, -pcsamp, -cy_hwc) find the PC more often in the code
     where the program spends elapsed time. The time may be spent in that code because it is executed a lot, or it might be
     spent there because those instructions are processed slowly owing to cache misses, memory contention, or failed
     branch prediction. Use these to get an overview of the program and to find major trouble spots. 
     The time bases that reflect instruction counts (-gi_hwc, -gfp_hwc) find the PC more often in the code that actually
     performs the most instructions. Use these to find the code that could benefit most from algorithmic changes. 
     The time bases that reflect data access (-dc_hwc, -sc_hwc, -tlb_hwc) find the PC more often in the code that has to
     wait for its data to be brought in from another memory level. Use these to find memory access problems. 
     The time bases that reflect code access (-ic_hwc, -isc_hwc) find the PC more often in the code that has to be fetched
     from memory when it is called. Use these to pinpoint functions that could be reorganized for better locality, or to see
     when automatic inlining has gone too far. 

It is easy to perform an experiment. Here is the application of an experiment to program adi2:

% ssrun -fpcsamp adi2
 Time:      7.619 seconds
 Checksum:  5.6160428338E+06

The output file of samples is left in a file with the default name of ./command.experiment.pid. In this case the name was adi2.fpcsamp.4885. It is often more convenient, however, to dictate the name of the output file. You can do this by putting the desired filename and directory in environment variables. Using this csh script you can run an experiment, passing the output directory and filename on the command line, for example

% ssruno -d /var/tmp -o adi2.cy -cy_hwc adi2
ssrun -cy_hwc adi2 ..................................
 Time:      9.644 seconds
 Checksum:  5.6160428338E+06
.................................. ssrun ends.
-rw-r--r--    1 guest    guest      18480 Dec 17 16:25 /var/tmp/adi2.cy

3.3.4.2 Using Prof

Regardless of which time base you use for sampling, you display the result using prof. By default, prof displays a list of procedures ordered from the one with the most samples to the least:

% prof adi2.fpcsamp.4885
-------------------------------------------------------------------------------
Profile listing generated Sat Jan  4 10:28:11 1997
    with:       prof adi2.fpcsamp.4885 
-------------------------------------------------------------------------------

samples   time    CPU    FPU   Clock   N-cpu  S-interval Countsize
   8574   8.6s R10000 R10010 196.0MHz   1      1.0ms     2(bytes)

Each sample covers 4 bytes for every  1.0ms ( 0.01% of 8.5740s)


-------------------------------------------------------------------------------
  -p[rocedures] using pc-sampling.
  Sorted in descending order by the number of samples in each procedure.
  Unexecuted procedures are excluded.
-------------------------------------------------------------------------------

samples   time(%)      cum time(%)      procedure (dso:file)

   6688   6.7s( 78.0)  6.7s( 78.0)         zsweep (adi2:adi2.f)
    671  0.67s(  7.8)  7.4s( 85.8)         xsweep (adi2:adi2.f)
    662  0.66s(  7.7)    8s( 93.6)         ysweep (adi2:adi2.f)
    208  0.21s(  2.4)  8.2s( 96.0)       fake_adi (adi2:adi2.f)
    178  0.18s(  2.1)  8.4s( 98.1)         irand_ (/usr/lib32/libftn.so:../../libF77/rand_.c)
    166  0.17s(  1.9)  8.6s(100.0)          rand_ (/usr/lib32/libftn.so:../../libF77/rand_.c)
      1 0.001s(  0.0)  8.6s(100.0)      __oserror (/usr/lib32/libc.so.1:oserror.c)

   8574   8.6s(100.0)  8.6s(100.0)          TOTAL

This profile indicates that you should focus on the routine zsweep, since it consumes almost 80% of the run time of this program. For finer detail, use the -heavy option. This supplements the list of procedures with a list of individual source line numbers, ordered by frequency:

-------------------------------------------------------------------------------
  -h[eavy] using pc-sampling.
     Sorted in descending order by the number of samples in each line.
     Lines with no samples are excluded.
-------------------------------------------------------------------------------

samples   time(%)      cum time(%)      procedure (file:line)

   3405   3.4s( 39.7)  3.4s( 39.7)         zsweep (adi2.f:122)
   3226   3.2s( 37.6)  6.6s( 77.3)         zsweep (adi2.f:126)
    425  0.42s(  5.0)  7.1s( 82.3)         xsweep (adi2.f:80)
    387  0.39s(  4.5)  7.4s( 86.8)         ysweep (adi2.f:101)
    273  0.27s(  3.2)  7.7s( 90.0)         ysweep (adi2.f:105)
    246  0.25s(  2.9)    8s( 92.9)         xsweep (adi2.f:84)
    167  0.17s(  1.9)  8.1s( 94.8)         irand_ (../../libF77/rand_.c:62)
    163  0.16s(  1.9)  8.3s( 96.7)       fake_adi (adi2.f:18)
    160  0.16s(  1.9)  8.5s( 98.6)          rand_ (../../libF77/rand_.c:69)
     45 0.045s(  0.5)  8.5s( 99.1)       fake_adi (adi2.f:59)
     32 0.032s(  0.4)  8.5s( 99.5)         zsweep (adi2.f:113)
     21 0.021s(  0.2)  8.5s( 99.7)         zsweep (adi2.f:121)
     11 0.011s(  0.1)  8.6s( 99.8)         irand_ (../../libF77/rand_.c:63)
      6 0.006s(  0.1)  8.6s( 99.9)          rand_ (../../libF77/rand_.c:67)
      4 0.004s(  0.0)  8.6s(100.0)         zsweep (adi2.f:125)
      1 0.001s(  0.0)  8.6s(100.0)         ysweep (adi2.f:104)
      1 0.001s(  0.0)  8.6s(100.0)         ysweep (adi2.f:100)
      1 0.001s(  0.0)  8.6s(100.0)      __oserror (oserror.c:127)

   8574   8.6s(100.0)  8.6s(100.0)          TOTAL

From this listing it is clear that lines 122 and 126 warrant further inspection. Even finer detail can be obtained with the -source option, which lists the source code and disassembled machine code, indicating sample hits on specific instructions.

3.3.4.3 Ideal Time Profiling

The other type of profiling is called ideal time, or basic block, profiling. Basic block is a compiler term for a section of code that has only one entrance and one exit. Any program can be decomposed into basic blocks. To obtain an ideal profile, ssrun copies the executable program and modifies the copy to contain code that records the entry to each basic block. Not only the executable itself but also all dynamic shared objects (DSOs; for more information, see dso(5)) that it links to are copied and instrumented. The instrumented executable and libraries are statically linked and run:

% ssrun -ideal adi2
Beginning libraries
        /usr/lib32/libssrt.so
        /usr/lib32/libss.so
        /usr/lib32/libfastm.so
        /usr/lib32/libftn.so
        /usr/lib32/libm.so
        /usr/lib32/libc.so.1
Ending libraries, beginning "adi2"
 Time:      8.291 seconds
 Checksum:  5.6160428338E+06

The number of times each basic block was encountered is recorded. The output data file is displayed using prof, just as for a sampled run. The report ranks source and library procedures from most to least used:

% prof adi2.ideal.4920

  Prof run at: Sat Jan  4 10:34:06 1997
 Command line: prof adi2.ideal.4920 


          285898739: Total number of cycles
           1.45867s: Total execution time
          285898739: Total number of instructions executed
              1.000: Ratio of cycles / instruction
                196: Clock rate in MHz
             R10000: Target processor modeled


---------------------------------------------------------
Procedures sorted in descending order of cycles executed.
Unexecuted procedures are not listed. Procedures
beginning with *DF* are dummy functions and represent
init, fini and stub sections.
---------------------------------------------------------

        cycles(%)  cum %     secs    instrns    calls procedure(dso:file)

  68026368(23.79)  23.79     0.35   68026368 32768 xsweep(adi2.pixie:adi2.f)
  68026368(23.79)  47.59     0.35   68026368 32768 ysweep(adi2.pixie:adi2.f)
  68026368(23.79)  71.38     0.35   68026368 32768 zsweep(adi2.pixie:adi2.f)
  35651584(12.47)  83.85     0.18   35651584 2097152 rand_(./libftn.so.pixn32:../../libF77/rand_.c)
  27262976( 9.54)  93.39     0.14   27262976 2097152 irand_(./libftn.so.pixn32:../../libF77/rand_.c)
  18874113( 6.60)  99.99     0.10   18874113     1 fake_adi(adi2.pixie:adi2.f)
     11508( 0.00)  99.99     0.00      11508     5 memset(./libc.so.1.pixn32:/slayer_xlv0/ficussg-nov05/work/irix/lib/libc/libc_n32_M4/strings/bzero.s)
      3101( 0.00)  99.99     0.00       3101    55 __flsbuf(./libc.so.1.pixn32:_flsbuf.c)
      2446( 0.00) 100.00     0.00       2446    42 x_putc(./libftn.so.pixn32:../../libI77/wsfe.c)
      1234( 0.00) 100.00     0.00       1234     2 x_wEND(./libftn.so.pixn32:../../libI77/wsfe.c)
      1047( 0.00) 100.00     0.00       1047     1 f_exit(./libftn.so.pixn32:../../libI77/close.c)
      1005( 0.00) 100.00     0.00       1005     5 fflush(./libc.so.1.pixn32:flush.c)
       639( 0.00) 100.00     0.00        639     4 do_fio64_mp(./libftn.so.pixn32:../../libI77/fmt.c)
       566( 0.00) 100.00     0.00        566     3 wrt_AP(./libftn.so.pixn32:../../libI77/wrtfmt.c)
       495( 0.00) 100.00     0.00        495     6 map_luno(./libftn.so.pixn32:../../libI77/util.c)
       458( 0.00) 100.00     0.00        458    14 op_gen(./libftn.so.pixn32:../../libI77/fmt.c)
       440( 0.00) 100.00     0.00        440     9 gt_num(./libftn.so.pixn32:../../libI77/fmt.c)
       414( 0.00) 100.00     0.00        414     1 getenv(./libc.so.1.pixn32:getenv.c)
                                    .
                                    .
                                    .

The -heavy option adds a list of source lines, sorted by their consumption of ideal instruction cycles. An ideal profile shows exactly and repeatedly which statements are most executed and gives you an exact view of the algorithmic complexity of the program. An ideal profile does not necessarily reflect where a program spends its time since it cannot take cache and TLB misses into account. Consequently, the results of the ideal profile are startlingly different from that of the PC sampling profile. These ideal results indicate that zsweep should take exactly the same amount of run time as ysweep and xsweep. These differences can be used to infer where cache performance issues exist. On machines without the R10000's hardware profiling registers, such comparisons are the only profiling method available to locate cache problems.

3.3.4.3.1 Operation Counts

Since ideal profiling counts the instructions executed by the program, it can provide all sorts of interesting information about the program. Already printed in the standard prof output are counts of how many times each subroutine is called. In addition, you may use the -op option to prof to get a listing detailing the counts of all instructions in the program. In particular, this will provide an exact count of the floating point operations executed:

% prof -op adi2.ideal.4920

  Prof run at: Wed Jan 15 14:42:54 1997
 Command line: prof -op adi2.ideal.4920


          285898739: Total number of cycles
           1.45867s: Total execution time
          285898739: Total number of instructions executed
              1.000: Ratio of cycles / instruction
                196: Clock rate in MHz
             R10000: Target processor modeled


---------------------------------------------------------
pixstats summary
---------------------------------------------------------
           56590456: Floating point operations (38.796 Mflops @ 196 MHz)
          105500230: Integer operations (72.3265 M intops @ 196 MHz)

                             .
                             .
                             .

Note that this is different from what you get using perfex. The R10000 counter #21 counts floating point instructions, not floating point operations. As a result, in a program that executes a lot of multiply-add instructions --- each of which carries out two floating point operations --- perfex's MFLOPS statistic can be off by a factor of two. Since prof -op records all instructions executed, it counts each multiply-add instruction as two floating point operations, thus providing the correct tally. The Mlops figure it calculates, however, is based on the ideal time; to calculate floating point performance, divide the number of floating point operations counted by prof -op by wall clock time. Either method of profiling, PC sampling or ideal, can be applied to multiprocessor runs just as easily as it is applied to single-processor runs; each thread of an application maintains its own histogram, and the histograms may be printed individually or merged in any combination and printed as one profile.

3.3.4.4 Gprof

One limitation of the prof output for either PC sampling or ideal time is that the information reported contains no information about the call hierarchy. That is, if the routine zsweep in the above example were called from two different locations in the program, you would not know how much time results from the call at each location; you would only know the total time spent in zsweep. If you knew that, say, the first location was responsible for the majority of the time, this could affect how you tune the program. For example, you might try inlining the call into the first location, but not bother with the second. Or, if you wanted to parallelize the program, knowing that the first location is where the majority of the time is spent, you might consider parallelizing the calls to zsweep there rather than trying to parallelize the zsweep routine itself. SpeedShop provides two methods of obtaining hierarchical profiling information. The first method, which is called gprof, is used in conjunction with the ideal time profile. To obtain the gprof information for the above example, simply add the flag -gprof to the prof command:

% prof -gprof adi2.ideal.4920

  Prof run at: Wed Jan 15 16:52:09 1997
 Command line: prof -gprof adi2.ideal.4920


          285898739: Total number of cycles
           1.45867s: Total execution time
          285898739: Total number of instructions executed
              1.000: Ratio of cycles / instruction
                196: Clock rate in MHz
             R10000: Target processor modeled


---------------------------------------------------------
Procedures sorted in descending order of cycles executed.
Unexecuted procedures are not listed. Procedures
beginning with *DF* are dummy functions and represent
init, fini and stub sections.
---------------------------------------------------------

        cycles(%)  cum %     secs    instrns    calls procedure(dso:file)

  68026368(23.79)  23.79     0.35   68026368 32768 xsweep(adi2.pixie:adi2.f)
  68026368(23.79)  47.59     0.35   68026368 32768 ysweep(adi2.pixie:adi2.f)
  68026368(23.79)  71.38     0.35   68026368 32768 zsweep(adi2.pixie:adi2.f)
  35651584(12.47)  83.85     0.18   35651584 2097152 rand_(./libftn.so.pixn32:../../libF77/rand_.c)
  27262976( 9.54)  93.39     0.14   27262976 2097152 irand_(./libftn.so.pixn32:../../libF77/rand_.c)
                                                  .
                                                  .
                                                  .

          All times are in milliseconds.
--------------------------------------------------------------------------------
          NOTE: any functions which are not part of the call
                graph are listed at the end of the gprof listing
--------------------------------------------------------------------------------

                                    self             kids        called/total    parents
 index          cycles(%)          self(%)          kids(%)  called+self     name     index
                                    self             kids        called/total    children

   [1]  285895481(100.00%)         57( 0.00%)  285895424(100.00%)     0       __start [1]
                                      50          285895369         1/1         main [2]
                                       3                  0         1/1         __istart [112]
                                       2                  0         1/1         __readenv_sigfpe [113]

--------------------------------------------------------------------------------

                                      50          285895369         1/1         __start [1]
   [2]  285895419(100.00%)         50( 0.00%)  285895369(100.00%)     1       main [2]
                                18874113          267020445         1/1         fake_adi [3]
                                     205                606         5/5         signal [44]

--------------------------------------------------------------------------------

                                18874113          267020445         1/1         main [2]
   [3]  285894558(100.00%)   18874113( 6.60%)  267020445(93.40%)     1       fake_adi [3]
                                68026368                  0     32768/32768     zsweep [4]
                                68026368                  0     32768/32768     ysweep [5]
                                68026368                  0     32768/32768     xsweep [6]
                                35651584           27262976     2097152/2097152     rand_ [7]
                                      28              13486         2/2         s_wsfe64 [9]
                                      26               5368         2/2         e_wsfe [17]
                                      22               2610         1/1         do_fioxr4v [25]
                                      22               2610         1/1         do_fioxr8v [24]
                                      23               2428         1/1         s_stop [28]
                                     114                 44         2/2         dtime_ [68]

--------------------------------------------------------------------------------

                                68026368                  0     32768/32768     fake_adi [3]
   [4]   68026368(23.79%)   68026368(100.00%)          0( 0.00%) 32768       zsweep [4]

--------------------------------------------------------------------------------

                                68026368                  0     32768/32768     fake_adi [3]
   [5]   68026368(23.79%)   68026368(100.00%)          0( 0.00%) 32768       ysweep [5]

--------------------------------------------------------------------------------

                                68026368                  0     32768/32768     fake_adi [3]
   [6]   68026368(23.79%)   68026368(100.00%)          0( 0.00%) 32768       xsweep [6]

--------------------------------------------------------------------------------
                                                  .
                                                  .
                                                  .

This produces the usual ideal time profiling output, but following that is the hierarchical information. There is a block of information for each subroutine in the program. A number, shown in brackets (e.g., [1]), is assigned to each routine so that the information pertaining to it can easily be located in the output. Let's look in detail at the block of information provided; we'll use fake_adi [3] as an example. The line beginning with the number [3] shows, from left to right, the:

     Number of cycles consumed by this routine and the routines it calls (its descendants) 
     Number of cycles spent inside the routine, but not in any of its descendants 
     Number of cycles spent in its descendants 
     Total number of times the routine was called 
     Name of the routine, fake_adi [3] 

Above this line are lines showing which routines fake_adi [3] was called from. In this case, it is only called from one place, main [2], so there is just one line (in general, there would be one line for each routine which calls fake_adi [3]). This line shows

     The proportion of the cycles spent inside fake_adi [3] as a result of the call from main [2] 
     The proportion of time spent in fake_adi [3]'s descendants as a result of the call from main [2] 
     How many calls there are to fake_adi [3] in main [2] / the total number of calls to fake_adi [3] from all
     places in the program 

Since fake_adi [3] is only called once, all the time in it and its descendants is da result of this one call. Below the line beginning with the number [3] are all the descendants of fake_adi [3]. For each descendant you see: The proportion of the descendant's cycles spent inside it The proportion of the descendant's cycles spent in its descendants (i.e., fake_adi [3]'s grandchildren) How many calls to the descendant there are in fake_adi [3] / the total number of calls to the descendant from all places in the program This block of information allows you to determine not just which subroutines but which paths in the program are responsible for the majority of time. The only limitation is that gprof reports ideal time, so cache misses are not represented.

3.3.4.5 Usertime Profiling

gprof only reports ideal time. To get hierarchical profiling information that accurately accounts for all the time in the program, the way PC sampling does, use usertime profiling. For this type of profiling, the program is sampled during the run. At each sample, the location of the program counter is noted and the entire call stack is traced to record which routines have been called to get to this point in the program. From this, a hierarchical profile is constructed. Since unwinding the call stack is an expensive operation, the sampling period is for usertime profiling is relatively long: 30 milliseconds. usertime profiling is performed with the following command:

% ssrun -usertime adi2

The output is written to a file called ./adi2.usertime.pid, where pid is the process ID for this run of adi2. The profile is displayed using prof just as for PC sampling and ideal time profiling: % prof adi2.usertime.19572 The output is as follows:

-------------------------------------------------------------------------------
Profile listing generated Wed Jan 15 16:57:10 1997
    with:       prof adi2.usertime.19572
-------------------------------------------------------------------------------

        Total Time (secs)     : 9.99
        Total Samples         : 333
        Stack backtrace failed: 0
        Sample interval (ms)  : 30
        CPU                   : R10000
        FPU                   : R10010
        Clock                 : 196.0MHz
        Number of CPUs        : 1


-------------------------------------------------------------------------------

index  %Samples     self descendents  total        name
[1]    100.0%    0.00        9.99     333          __start
[2]    100.0%    0.00        9.99     333          main
[3]    100.0%    0.09        9.90     333          fake_adi
[4]     80.8%    8.07        0.00     269          zsweep
[5]      7.5%    0.75        0.00      25          xsweep
[6]      6.9%    0.69        0.00      23          ysweep
[7]      3.9%    0.12        0.27      13          rand_
[8]      2.7%    0.27        0.00       9          irand_

The information is less detailed than that provided by gprof, but when combined with gprof, you can get a complete hierarchical profile for all routines which have run long enough to be sampled.

Multiprocessor Programming

4.1 Introduction

The selling point of a multiprocessor is that you can apply the power of multiple CPUs to your program, so that it completes in less time. When we start running a program concurrently on multiple CPUs, we discover:

     There is a mathematical limit on how fast a real program can go.
     A variety of programming models can be used to express parallelism. 
     Parallel processing introduces new bottlenecks, but there are additional 
     tools to deal with them.
You can write a program that expresses parallel execution in the source code, or you can let the Power Fortran or Power C compiler find the parallelism that is implicit in a conventional, serial program.

4.2 Parallel Speedup and Amdahl's Law

When your program runs on more than one CPU, its total run time should be less. But how much less? And what are the limits on the speedup?

4.2.1 Adding CPUs to Shorten Execution Time

You can make your program run faster by distributing the work it does over multiple CPUs. There is always some part of the program's logic that has to be executed serially, by a single CPU. However, suppose there is one loop where the program spends 50% of the execution time. If you can divide the iterations of this loop so that half of them are done in one CPU while the other half are done at the same time in a different CPU, the whole loop can be finished in half the time. The result: a 25% reduction in program execution time. The following topics cover the formal mathematical treatment of these ideas (Amdahl's law). There are two basic limits to the speedup you can achieve by parallel execution:

     The fraction of the program that can be run in parallel, p, is never 100%. 
     Because of hardware constraints, after a certain point, there is less and 
     less benefit from each added CPU. 
Tuning for parallel execution comes down to battling these two limits. You strive to increase the parallel fraction, p, because in some cases even a small change in p (from 0.8 to 0.85, for example) makes a dramatic change in the effectiveness of added CPUs. And you work to ensure that each added CPU does a full CPU's work and does not interfere with the work of other CPUs. In the Origin architectures this means
     Evenly spreading the workload among the CPUs 
     Eliminating false sharing and other types of memory contention between CPUs 
     Making sure that the data used by each CPU are located in a memory near that CPU's node 
There are two ways to obtain the use of multiple CPUs. You can write your source code to use explicit parallelism --- showing in the source code which parts of the program are to execute asynchronously and how the parts are to coordinate with each other. Alternatively you can take a conventional program in C, C++, or Fortran, and apply a precompiler to it to find the parallelism that is implicit in the code. These precompilers, IRIS Power C and Power Fortran, insert the source directives for parallel execution for you.

4.2.2 Parallel Speedup

If half the iterations of a DO-loop are performed on one CPU, and the other half run at the same time on a second CPU, the whole DO-loop should complete in half the time. For example,

   for (j=0; j==MAX; ++j) { z[j] = a[j]*b[j]; }
The IRIS Power C package can automatically distribute such a loop over n CPUs (with n decided at run time based on the available hardware), so that each CPU performs MAX/n iterations. The speedup gained from applying n CPUs, Speedup(n), is the ratio of the one-CPU execution time to the n-CPU parallel execution time:

Speedup(n) = T(1)/T(n).

If you measure the one-CPU execution time of a program at 100 seconds, and the program runs in 60 seconds with 2 CPUs, Speedup(2) = 100/60 = 1.67. This number captures the improvement from adding hardware to the system. We expect T(n) to be less than T(1); if it is not, adding CPUs has made the program slower, and something is wrong! So Speedup(n) should be a number greater than 1.0, and the greater it is, the more pleased we are. Intuitively you might hope that the speedup would be equal to the number of CPUs --- twice as many CPUs, half the time --- but this ideal can never be achieved (well, almost never). Normally, the number Speedup(n) must be less than n, reflecting the fact that not all parts of a program benefit from parallel execution. However, it is possible --- in rare situations --- for Speedup(n) to be larger than n. This is called a isuperlinear speedup --- the program has been sped up by more than the increase of CPUs. A superlinear speedup does not really result from parallel execution. It comes about because each CPU is now working on a smaller set of memory. The problem data handled by any one CPU fits better in cache, so each CPU executes faster than the single CPU could do. A superlinear speedup is welcome, but it indicates that the sequential program was being held back by cache effects.

4.2.3 Amdahl's Law

There are always parts of a program that you cannot make parallel --- code that must run serially. For example, consider the DO-loop. Some amount of code is devoted to setting up the loop, allocating the work between CPUs. Then comes the parallel part, with all CPUs running concurrently. At the end of the loop is more housekeeping that must be done serially; for example, if n does not divide MAX evenly, one CPU must execute the few iterations that are left over. The serial parts of the program cannot be speeded up by concurrency.

The mathematical statement of this idea is called Amdahl's law. Let p be the fraction of the program's code that can be made parallel (p is always a fraction less than 1.0.) The remaining (1-p) of the code must run serially. In practical cases, p ranges from 0.2 to 0.99.

The potential speedup for a program is proportional to p divided by the CPUs you can apply, plus the remaining serial part, (1-p):

                  1
Speedup(n) = -----------    (Amdahl's law: Speedup(n) given p)
             (p/n)+(1-p)

Suppose p = 0.8; then Speedup(2 )= 1/(0.4+0.2) = 1.67, and Speedup(4) = 1/(0.2+0.2) = 2.5.

The maximum possible speedup --- if you could apply an infinite number of CPUs --- would be 1/(1-p). In particular, the more CPUs you have, the more benefit you get from increasing p. Using only 4 CPUs, you need only p = 0.6 to get half the ideal speedup. With 8 CPUs, you need p = 0.85 to get half the ideal speedup.

4.2.4 Calculating the Parallel Fraction of a Program

You do not have to guess at the value of p for a given program. Measure the execution times T(1) and T(2) to calculate a measured Speedup(2) = T(1)/T(2). The Amdahl's law equation can be rearranged to yield p when Speedup(n) is known:

     2    SpeedUp(2) - 1
p = --- * --------------    (Amdahl's law: p given Speedup(2))
     1      SpeedUp(2) 

Suppose you measure T(1) = 188 seconds and T(2) = 104 seconds. SpeedUp(2) = 188/104 = 1.81 p = 2 * ((1.81-1)/1.81) = 2*(0.81/1.81) = 0.895

4.2.5 Predicting Execution Time with n CPUs

In some cases, the Speedup(2) = T(1)/T(2) is a value greater than 2, in other words, a superlinear speedup (described earlier). When this occurs, the formula in the preceding section returns a value of p greater than 1.0, clearly not useful. In this case you need to calculate p from two other more realistic timings, for example T(2) and T(3). The general formula for p is:

                Speedup(n) - Speedup(m)
p  =  -------------------------------------------
      (1 - 1/n)*Speedup(n) - (1 - 1/m)*Speedup(m)

where n and m are the two processor counts whose speedups are known. You can use this calculated value of p to extrapolate the potential speedup with higher numbers of processors. For example, what is the expected time with 4 CPUs?

Speedup(4)= 1/((0.895/4)+(1-0.895)) = 3.04 T(4)= T(1)/Speedup(4) = 188/3.04 = 61.8

4.5.5 Cache Friendly Programs

Data placement issues won't arise for all parallel programs: those which are cache friendly do not incur performance penalties if the data placement is not optimal. This is because such programs satisfy their memory requests primarily from cache, rather than main memory, and the time is the same for each processor to access its caches. Thus, if some memory requests take a bit longer than ideal, the net effect on the program will be negligible. So data placement can only be an issue for parallel programs which are very memory intensive and are not cache friendly. You can determine how cache friendly a program is using perfex(1). The command

% perfex -a -x -y a.out

will tell you how many primary and secondary cache misses the program a.out generates, what the cache hit ratios are, and it will estimate how much the cache misses cost. If they only account for a small percentage of the run time, then the program makes good use of the cache and its performance will not be affected by data placement. On the other hand, if the time spent in cache misses is high, then the program is not cache friendly and data placement could affect performance. But there are other issues common to all cache-based shared memory platforms which are more likely to be the source of performance problems, and we consider these in the next section.

4.5.6 Non Cache Friendly Programs

If a properly parallelized MP library program does not scale as well as expected, there are several potential causes. The first thing you need to check for is whether false sharing or other forms of cache coherency contention are a problem. Cache coherency contention can arise in any cache-based multiprocessor system. It can only be an issue, however, for data that are frequently updated or written. Data that are mostly read and rarely written do not cause cache coherency contention. When one CPU modifies a cache line, any other CPU that has a copy of the same cache line is notified, discards its copy, and fetches a new copy when it needs those data again. This can cause performance problems in two cases:

1. If one CPU repeatedly updates a cache line that other CPUs use for input, all the reading CPUs are forced to frequently retrieve a new copy of that cache line from memory. This slows all the reading CPUs.

2. If two or more CPUs repeatedly update the same cache line, they contend for the exclusive ownership of the cache line. Each CPU has to get ownership and fetch a new copy of the cache line before it can perform its update. This forces the updating CPUs to execute serially, as well as making all CPUs fetch a new copy of the cache line on every use.

The first of these is generic cache coherency contention. The latter is referred to as false sharing. Fortunately, the tools available for Origin can help you identify these problems. In both cases, perfex(1) will reveal a high number of cache invalidation events (cf. hardware counters 31, 12, 13, 28, and 29 --- refer to the description of perfex for details of using the hardware counters). In addition, the CPU(s) repeatedly updating the same cache line shows a high count of stores to shared cache lines (counter 31). Some examples of how these problems occur and what you can do to fix them are now presented.

4.5.6.1 False Sharing

False sharing is best demonstrated with an example. Consider the following code:

      integer m, n, i, j
      real    a(m,n), s(m)

c$doacross local(i,j), shared(s,a) do i = 1, m s(i) = 0.0 do j = 1, n s(i) = s(i) + a(i,j) enddo enddo

This code calculates the sums of the rows of a matrix. For simplicity, assume m = 4 and that the code is run on up to four processors. What you observe is that the time for the parallel runs is longer than when just one processor is used. To understand what causes this, let's look at what happens when this loop is run in parallel. The following is a timeline of the operations which are carried out (more or less) simultaneously:

   t = 0             t = 1                   t = 2            ...

s(1) = 0.0    s(1) = s(1) + a(1,1)    s(1) = s(1) + a(1,2)    ...
s(2) = 0.0    s(2) = s(2) + a(2,1)    s(2) = s(2) + a(2,2)    ...
s(3) = 0.0    s(3) = s(3) + a(3,1)    s(3) = s(3) + a(3,2)    ...
s(4) = 0.0    s(4) = s(4) + a(4,1)    s(4) = s(4) + a(4,2)    ...

At each stage of the calculation, all four processors attempt to concurrently update one element of the sum array, s(m). For a processor to update one element of s, it needs to gain exclusive access to the cache line holding the element it wishes to update. But since s is only four words in size, it is likely that s is contained entirely in a single cache line, so each processor needs exclusive access to all of s. Thus, only one processor at a time can update an element of s. Instead of operating in parallel, the calculation is serialized. Actually, it's a bit worse than just serialized. For a processor to gain exclusive access to a cache line, it first needs to invalidate any cached copies of s which may reside in the other processors. Then it needs to read a fresh copy of the cache line from memory since the invalidations will have caused data in some other processor's cache to be written back to main memory. So, whereas in a sequential version, the element of s being updated can be kept in a register, in the parallel version, false sharing forces the value to continually be reread from memory, in addition to serializing the updates. This serialization is purely a result of the unfortunate accident that the different elements of s ended up in the same cache line. If each element were in a separate cache line, each processor could keep a copy of the appropriate line in its cache, and the calculations could be done perfectly in parallel. This, then, shows how to fix the problem. All you need do is spread out the elements of s so that each is in its own cache line. Here is a simple way to do this:
      integer m, n, i, j
      real    a(m,n), s(32,m)

c$doacross local(i,j), shared(s,a)
      do i = 1, m
         s(1,i) = 0.0
         do j = 1, n
            s(1,i) = s(1,i) + a(i,j)
         enddo
      enddo

Convert s to a two-dimensional array with the first dimension one (secondary) cache line in size.

Tuning Highlights | PECM | SCD | Tips