-lperfex
Since you can use SpeedShop to see where in a program various event counts come from, libperfex is not described in detail.
More information can be found in its reference page.
3.3.3.1 Absolute Counts of One or Two Events
Use perfex options to specify what is to be counted. You can specify one or two countable events. In this case, the counts are
absolute and accurate. For example, the command
% perfex -e 15 -e 21 adi2
runs the subject program and reports the exact counts of graduated instructions and graduated floating point instructions.
You use this mode to explore specific points of program behavior.
3.3.3.2 Statistical Counts of All Events
When you specify option -a (all events), perfex multiplexes all 32 events over the program run. Each count is active 1/16 of
the time and then scaled by 16. The resulting counts have some statistical error. The error is small (and the counts
sufficiently repeatable) provided that the subject program runs in a stable execution mode for a number of seconds. When
the program runs for a short time, or shifts between radically different regimes of instruction or data use, the counts are less
dependable and less repeatable. Nevertheless, perfex -a usually gives a good overview of program operation.
Here is the perfex command line and output applied to a sample program called adi2:
% perfex -a -x adi2
WARNING: Multiplexing events to project totals--inaccuracy possible.
Time: 7.990 seconds
Checksum: 5.6160428338E+06
0 Cycles...................................................... 1645481936
1 Issued instructions......................................... 677976352
2 Issued loads................................................ 111412576
3 Issued stores............................................... 45085648
4 Issued store conditionals................................... 0
5 Failed store conditionals................................... 0
6 Decoded branches............................................ 52196528
7 Quadwords written back from scache.......................... 61794304
8 Correctable scache data array ECC errors.................... 0
9 Primary instruction cache misses............................ 8560
10 Secondary instruction cache misses.......................... 304
11 Instruction misprediction from scache way prediction table.. 272
12 External interventions...................................... 6144
13 External invalidations...................................... 10032
14 Virtual coherency conditions................................ 0
15 Graduated instructions...................................... 371427616
16 Cycles...................................................... 1645481936
17 Graduated instructions...................................... 400535904
18 Graduated loads............................................. 90474112
19 Graduated stores............................................ 34776112
20 Graduated store conditionals................................ 0
21 Graduated floating point instructions....................... 28292480
22 Quadwords written back from primary data cache.............. 32386400
23 TLB misses.................................................. 5687456
24 Mispredicted branches....................................... 410064
25 Primary data cache misses................................... 16330160
26 Secondary data cache misses................................. 7708944
27 Data misprediction from scache way prediction table......... 663648
28 External intervention hits in scache........................ 6144
29 External invalidation hits in scache........................ 6864
30 Store/prefetch exclusive to clean block in scache........... 7582256
31 Store/prefetch exclusive to shared block in scache.......... 8144
The -x option requests that perfex also gather counts for kernel code that handles exceptions, so the work done by the OS to
handle TLB misses is included in these counts.
3.3.3.3 Analytic Output with the -y Option
The raw event counts are interesting, but it is more useful to convert them to elapsed time. Some time estimates are simple,
for example, dividing the cycle count by the machine clock rate gives the elapsed run time (1645481936 / 195 MHz = 8.44
seconds). Other events are not as simple and can only be stated in terms of a range of times. For example, the time to handle
a primary cache miss varies depending on whether the needed data are in the secondary cache, in memory, or in the cache of
another CPU. Analysis of this kind can be requested using perfex -a -x -y.
When you use -a , -x, and -y, perfex collects and displays all event counts, but it also displays a report of estimated times
based on the counts. Here is an example, again, of the program adi2:
% perfex -a -x -y adi2
WARNING: Multiplexing events to project totals--inaccuracy possible.
Time: 7.996 seconds
Checksum: 5.6160428338E+06
Based on 196 MHz IP27
Typical Minimum Maximum
Event Counter Name Counter Value Time (sec) Time (sec) Time (sec)
===================================================================================================================
0 Cycles...................................................... 1639802080 8.366337 8.366337 8.366337
16 Cycles...................................................... 1639802080 8.366337 8.366337 8.366337
26 Secondary data cache misses................................. 7736432 2.920580 1.909429 3.248837
23 TLB misses.................................................. 5693808 1.978017 1.978017 1.978017
7 Quadwords written back from scache.......................... 61712384 1.973562 1.305834 1.973562
25 Primary data cache misses................................... 16368384 0.752445 0.235504 0.752445
22 Quadwords written back from primary data cache.............. 32385280 0.636139 0.518825 0.735278
2 Issued loads................................................ 109918560 0.560809 0.560809 0.560809
18 Graduated loads............................................. 88890736 0.453524 0.453524 0.453524
6 Decoded branches............................................ 52497360 0.267844 0.267844 0.267844
3 Issued stores............................................... 43923616 0.224100 0.224100 0.224100
19 Graduated stores............................................ 33430240 0.170562 0.170562 0.170562
21 Graduated floating point instructions....................... 28371152 0.144751 0.072375 7.527040
30 Store/prefetch exclusive to clean block in scache........... 7545984 0.038500 0.038500 0.038500
24 Mispredicted branches....................................... 417440 0.003024 0.001363 0.011118
9 Primary instruction cache misses............................ 8272 0.000761 0.000238 0.000761
10 Secondary instruction cache misses.......................... 768 0.000290 0.000190 0.000323
31 Store/prefetch exclusive to shared block in scache.......... 15168 0.000077 0.000077 0.000077
1 Issued instructions......................................... 673476960 0.000000 0.000000 3.436107
4 Issued store conditionals................................... 0 0.000000 0.000000 0.000000
5 Failed store conditionals................................... 0 0.000000 0.000000 0.000000
8 Correctable scache data array ECC errors.................... 0 0.000000 0.000000 0.000000
11 Instruction misprediction from scache way prediction table.. 432 0.000000 0.000000 0.000002
12 External interventions...................................... 6288 0.000000 0.000000 0.000000
13 External invalidations...................................... 9360 0.000000 0.000000 0.000000
14 Virtual coherency conditions................................ 0 0.000000 0.000000 0.000000
15 Graduated instructions...................................... 364303776 0.000000 0.000000 1.858693
17 Graduated instructions...................................... 392675440 0.000000 0.000000 2.003446
20 Graduated store conditionals................................ 0 0.000000 0.000000 0.000000
27 Data misprediction from scache way prediction table......... 679120 0.000000 0.000000 0.003465
28 External intervention hits in scache........................ 6288 0.000000 0.000000 0.000000
29 External invalidation hits in scache........................ 5952 0.000000 0.000000 0.000000
Statistics
=========================================================================================
Graduated instructions/cycle................................................ 0.222163
Graduated floating point instructions/cycle................................. 0.017302
Graduated loads & stores/cycle.............................................. 0.074595
Graduated loads & stores/floating point instruction......................... 5.422486
Mispredicted branches/Decoded branches...................................... 0.007952
Graduated loads/Issued loads................................................ 0.808696
Graduated stores/Issued stores.............................................. 0.761099
Data mispredict/Data scache hits............................................ 0.078675
Instruction mispredict/Instruction scache hits.............................. 0.057569
L1 Cache Line Reuse......................................................... 6.473003
L2 Cache Line Reuse......................................................... 1.115754
L1 Data Cache Hit Rate...................................................... 0.866185
L2 Data Cache Hit Rate...................................................... 0.527355
Time accessing memory/Total time............................................ 0.750045
L1--L2 bandwidth used (MB/s, average per process)........................... 124.541093
Memory bandwidth used (MB/s, average per process)........................... 236.383187
MFLOPS (average per process)................................................ 3.391108
"Maximum," "minimum," and "typical" time cost estimates are reported. Each is obtained by consulting an internal table
which holds the maximum, minimum, and typical costs for each event, and multiplying this cost by the count for the event.
Event costs are usually measured in terms of machine cycles, and so the cost of an event generally depends on the clock
speed of the processor, which is also reported in the output.
The maximum value in the table corresponds to the worst-case cost of a single occurrence of the event. Sometimes this can
be a pessimistic estimate. For example, the maximum cost for graduated floating point instructions assumes that every
floating point instruction is a double-precision reciprocal square root since it is the most costly R10000 floating point
instruction.
Because of the latency-hiding capabilities of the R10000, the minimum cost of virtually any event could be zero since most
events can be overlapped with other operations. To avoid simply reporting minimum costs of zero, which would be of no
practical use, the minimum time reported by perfex -y corresponds to the best-case cost of a single occurrence of the event.
The best-case cost is obtained by running the maximum number of simultaneous occurrences of that event and averaging the
cost. For example, two floating point instructions can complete per cycle, so the best case cost is 0.5 cycles per floating
point instruction.
The typical cost falls somewhere between minimum and maximum and is meant to correspond to the cost you see in average
programs.
perfex -y prints the event counts and associated cost estimates sorted from most costly to least costly. Although resembling
a profiling output, this is not a true profile. The event costs reported are only estimates. Furthermore, since events do
overlap with one another, the sum of the estimated times will usually exceed the program's run time. This output should
only be used to identify which events are responsible for significant portions of the program's run time and to get a rough
idea of what those costs might be.
In the example above, the program spends a significant fraction of its time handling secondary cache and TLB misses. To
make a significant improvement in the run time of this program, the tuning measures need to concentrate on reducing those
cache misses.
In addition to the event counts and cost estimates, perfex -y also reports a number of statistics derived from the typical
costs. The meaning of many of the statistics is self-evident, for example, Graduated instructions/cycle. Below is a
list of those statistics whose definitions require more explanation:
Data mispredict/Data scache hits
The ratio of the counts for data misprediction from scache way prediction table and secondary data cache misses.
Instruction mispredict/Instruction scache hits
The ratio of the counts for instruction misprediction from scache way prediction table and secondary instruction cache
misses.
L1 Cache Line Reuse
The number of times, on average, that a primary data cache line is used after it has been moved into the cache. It is
calculated as graduated loads plus graduated stores minus primary data cache misses, divided by primary data cache
misses.
L2 Cache Line Reuse
The number of times, on average, that a secondary data cache line is used after it has been moved into the cache. It is
calculated as primary data cache misses minus secondary data cache misses, divided by secondary data cache misses.
L1 Data Cache Hit Rate
The fraction of data accesses that are satisfied from a cache line already resident in the primary data cache. It is
calculated as 1.0 - (primary data cache misses divided by the sum of graduated loads and graduated stores).
L2 Data Cache Hit Rate
The fraction of data accesses that are satisfied from a cache line already resident in the secondary data cache. It is
calculated as 1.0 - (secondary data cache misses divided by
primary data cache misses).
Time accessing memory/Total time
The sum of the typical costs of graduated loads, graduated stores, primary data cache misses, secondary data cache
misses, and TLB misses, divided by the total program run time. The total program run time is calculated by
multiplying cycles by the time per cycle (inverse of the processor's clock speed).
L1--L2 bandwidth used (MB/s, average per process)
The amount of data moved between the primary and secondary data caches, divided by the total program run time. The
amount of data moved is calculated as the sum of the number of primary data cache misses multiplied by the primary
cache line size and the number of quadwords written back from primary data cache multiplied by the size of a
quadword (16 bytes). For multiprocessor programs, the resulting figure is a per-process average since the
counts measured by perfex are aggregates of the counts for all the threads. Multiply by the number of threads to get
the total program bandwidth.
Memory bandwidth used (MB/s, average per process)
The amount of data moved between the secondary data cache and main memory, divided by the total program run
time. The amount of data moved is calculated as the sum of the number of secondary data cache misses multiplied by
the secondary cache line size and the number of quadwords written back from secondary data cache multiplied by the
size of a quadword (16 bytes). For multiprocessor programs, the resulting figure is a per-process average since the
counts measured by perfex are aggregates of the counts for all the threads. Multiply by the number of threads to get
the total program bandwidth.
MFLOPS (MB/s, average per process)
The ratio of the graduated floating point instructions and the total program run time. Note that although a
multiply-add carries out two floating point operations, it only counts as one
instruction, so this statistic may underestimate the number of floating point operations per second. For multiprocessor
programs, the resulting figure is a per-process average since the counts measured by perfex are aggregates of the
counts for all the threads. Multiply by the number of threads to get the total program rate.
These statistics give you a quick way to identify performance problems in your program. For example, the cache hit-rate
statistics tell you how cache friendly your program is. Since a secondary cache miss is much more expensive than a cache
hit, the L2 Data Cache Hit Rate needs to be close to 1.0 to indicate that the program is not paying a large penalty for
the cache misses. Values of ~0.96 and above indicate good cache performance. Note that, for the above example, the rate is
0.53, further confirmation of the cache problems in this program.
3.3.4 Using SpeedShop
The purpose of profiling is to find out exactly where a program is spending its time, that is, in precisely which procedures or
lines of code. Then you can concentrate your efforts on the (usually small) areas of code where there is the most to be
gained.
Profiling using the SpeedShop package supports these methods:
Sampling, in which the subject program is frequently interrupted; the program counter (PC) and stack are recorded on
each interruption. The more frequently the PC is found in a particular procedure, the more execution time that
procedure costs. SpeedShop can sample on a variety of time bases: the system timer or any of the R10000 performance
counters.
Ideal counting, in which a copy of the subject program binary is modified with trap instructions at the end of each
basic block. During execution, the exact number of uses of each basic block is counted.
Exception trace, not really a profiling method, records only floating point exceptions and their locations.
The SpeedShop package has three parts:
ssrun performs experiments and collects data .
prof processes data and prepares reports.
The ssapi interface allows you to insert caliper points into a program to profile specific sections of code or phases of
execution.
These programs are documented in the following reference pages:
speedshop documents the types of experiments, as well as a number of environment variables you can set.
ssrun documents specific options of ssrun. You need both speedshop(1) and ssrun(1) to run an experiment.
prof documents the report types and the options you use to get them.
ssapi documents the three library calls you can use.
3.3.4.1 PC Sampling Profiling
The accuracy of sampling depends on the time base that sets the sampling interval. In each case, the time base is the
independent variable and the program state is the dependent variable. Select from the sampling methods summarized in this
table:
ssrun
Option
Time Base
Comments
-usertime
30 ms timer
Fairly coarse resolution; experiment runs quickly and output file is small; some bugs noted
in speedshop(1).
-pcsamp[x]
-fpcsamp[x]
10 ms timer
1 ms timer
Moderately coarse resolution; functions that cause cache misses or page faults are
emphasized. Suffix x for 32-bit counts.
-gi_hwc
-fgi_hwc
32771 insts
6553 insts
Fine-grain resolution based on graduated instructions. Emphasizes functions that burn a lot
of instructions.
-cy_hwc
-fcy_hwc
16411 clocks
3779 clocks
Fine-grain resolution based on elapsed cycles. Emphasizes functions with cache misses and
mispredicted branches.
-ic_hwc
-fic_hwc
2053 icache miss
419 icache miss
Emphasizes code that doesn't fit in L1 cache.
-isc_hwc
-fisc_hwc
131 scache miss
29 scache miss
Emphasizes code that doesn't fit in L2 cache. Should be coarse-grained measure.
-dc_hwc
-fdc_hwc
2053 dcache miss
419 dcache miss
Emphasizes code that causes L1 cache data misses.
-dsc_hwc
-fdsc_hwc
131 scache miss
29 scache miss
Emphasizes code that causes L2 cache data misses.
-tlb_hwc
-ftlb_hwc
257 TLB miss
53 TLB miss
Emphasizes code that causes page faults.
-gfp_hwc
-fgfp_hwc
32771 fp insts
6553 fp insts
Emphasizes code that performs heavy FP calculation.
-prof_hwc
user-set
Hardware counter and overflow value from environment variables.
Each time base finds the program PC more often in the code that consumes the most units of that time base:
The time bases that reflect actual elapsed time (-usertime, -pcsamp, -cy_hwc) find the PC more often in the code
where the program spends elapsed time. The time may be spent in that code because it is executed a lot, or it might be
spent there because those instructions are processed slowly owing to cache misses, memory contention, or failed
branch prediction. Use these to get an overview of the program and to find major trouble spots.
The time bases that reflect instruction counts (-gi_hwc, -gfp_hwc) find the PC more often in the code that actually
performs the most instructions. Use these to find the code that could benefit most from algorithmic changes.
The time bases that reflect data access (-dc_hwc, -sc_hwc, -tlb_hwc) find the PC more often in the code that has to
wait for its data to be brought in from another memory level. Use these to find memory access problems.
The time bases that reflect code access (-ic_hwc, -isc_hwc) find the PC more often in the code that has to be fetched
from memory when it is called. Use these to pinpoint functions that could be reorganized for better locality, or to see
when automatic inlining has gone too far.
It is easy to perform an experiment. Here is the application of an experiment to program adi2:
% ssrun -fpcsamp adi2
Time: 7.619 seconds
Checksum: 5.6160428338E+06
The output file of samples is left in a file with the default name of ./command.experiment.pid. In this case the name was
adi2.fpcsamp.4885. It is often more convenient, however, to dictate the name of the output file. You can do this by putting
the desired filename and directory in environment variables. Using this csh script you can run an experiment, passing the
output directory and filename on the command line, for example
% ssruno -d /var/tmp -o adi2.cy -cy_hwc adi2
ssrun -cy_hwc adi2 ..................................
Time: 9.644 seconds
Checksum: 5.6160428338E+06
.................................. ssrun ends.
-rw-r--r-- 1 guest guest 18480 Dec 17 16:25 /var/tmp/adi2.cy
3.3.4.2 Using Prof
Regardless of which time base you use for sampling, you display the result using prof. By default, prof displays a list of
procedures ordered from the one with the most samples to the least:
% prof adi2.fpcsamp.4885
-------------------------------------------------------------------------------
Profile listing generated Sat Jan 4 10:28:11 1997
with: prof adi2.fpcsamp.4885
-------------------------------------------------------------------------------
samples time CPU FPU Clock N-cpu S-interval Countsize
8574 8.6s R10000 R10010 196.0MHz 1 1.0ms 2(bytes)
Each sample covers 4 bytes for every 1.0ms ( 0.01% of 8.5740s)
-------------------------------------------------------------------------------
-p[rocedures] using pc-sampling.
Sorted in descending order by the number of samples in each procedure.
Unexecuted procedures are excluded.
-------------------------------------------------------------------------------
samples time(%) cum time(%) procedure (dso:file)
6688 6.7s( 78.0) 6.7s( 78.0) zsweep (adi2:adi2.f)
671 0.67s( 7.8) 7.4s( 85.8) xsweep (adi2:adi2.f)
662 0.66s( 7.7) 8s( 93.6) ysweep (adi2:adi2.f)
208 0.21s( 2.4) 8.2s( 96.0) fake_adi (adi2:adi2.f)
178 0.18s( 2.1) 8.4s( 98.1) irand_ (/usr/lib32/libftn.so:../../libF77/rand_.c)
166 0.17s( 1.9) 8.6s(100.0) rand_ (/usr/lib32/libftn.so:../../libF77/rand_.c)
1 0.001s( 0.0) 8.6s(100.0) __oserror (/usr/lib32/libc.so.1:oserror.c)
8574 8.6s(100.0) 8.6s(100.0) TOTAL
This profile indicates that you should focus on the routine zsweep, since it consumes almost 80% of the run time of this
program.
For finer detail, use the -heavy option. This supplements the list of procedures with a list of individual source line numbers,
ordered by frequency:
-------------------------------------------------------------------------------
-h[eavy] using pc-sampling.
Sorted in descending order by the number of samples in each line.
Lines with no samples are excluded.
-------------------------------------------------------------------------------
samples time(%) cum time(%) procedure (file:line)
3405 3.4s( 39.7) 3.4s( 39.7) zsweep (adi2.f:122)
3226 3.2s( 37.6) 6.6s( 77.3) zsweep (adi2.f:126)
425 0.42s( 5.0) 7.1s( 82.3) xsweep (adi2.f:80)
387 0.39s( 4.5) 7.4s( 86.8) ysweep (adi2.f:101)
273 0.27s( 3.2) 7.7s( 90.0) ysweep (adi2.f:105)
246 0.25s( 2.9) 8s( 92.9) xsweep (adi2.f:84)
167 0.17s( 1.9) 8.1s( 94.8) irand_ (../../libF77/rand_.c:62)
163 0.16s( 1.9) 8.3s( 96.7) fake_adi (adi2.f:18)
160 0.16s( 1.9) 8.5s( 98.6) rand_ (../../libF77/rand_.c:69)
45 0.045s( 0.5) 8.5s( 99.1) fake_adi (adi2.f:59)
32 0.032s( 0.4) 8.5s( 99.5) zsweep (adi2.f:113)
21 0.021s( 0.2) 8.5s( 99.7) zsweep (adi2.f:121)
11 0.011s( 0.1) 8.6s( 99.8) irand_ (../../libF77/rand_.c:63)
6 0.006s( 0.1) 8.6s( 99.9) rand_ (../../libF77/rand_.c:67)
4 0.004s( 0.0) 8.6s(100.0) zsweep (adi2.f:125)
1 0.001s( 0.0) 8.6s(100.0) ysweep (adi2.f:104)
1 0.001s( 0.0) 8.6s(100.0) ysweep (adi2.f:100)
1 0.001s( 0.0) 8.6s(100.0) __oserror (oserror.c:127)
8574 8.6s(100.0) 8.6s(100.0) TOTAL
From this listing it is clear that lines 122 and 126 warrant further inspection. Even finer detail can be obtained with the
-source option, which lists the source code and disassembled machine code, indicating sample hits on specific instructions.
3.3.4.3 Ideal Time Profiling
The other type of profiling is called ideal time, or basic block, profiling. Basic block is a compiler term for a section of code
that has only one entrance and one exit. Any program can be decomposed into basic blocks.
To obtain an ideal profile, ssrun copies the executable program and modifies the copy to contain code that records the entry
to each basic block. Not only the executable itself but also all dynamic shared objects (DSOs; for more information, see
dso(5)) that it links to are copied and instrumented. The instrumented executable and libraries are statically linked and run:
% ssrun -ideal adi2
Beginning libraries
/usr/lib32/libssrt.so
/usr/lib32/libss.so
/usr/lib32/libfastm.so
/usr/lib32/libftn.so
/usr/lib32/libm.so
/usr/lib32/libc.so.1
Ending libraries, beginning "adi2"
Time: 8.291 seconds
Checksum: 5.6160428338E+06
The number of times each basic block was encountered is recorded. The output data file is displayed using prof, just as for a
sampled run. The report ranks source and library procedures from most to least used:
% prof adi2.ideal.4920
Prof run at: Sat Jan 4 10:34:06 1997
Command line: prof adi2.ideal.4920
285898739: Total number of cycles
1.45867s: Total execution time
285898739: Total number of instructions executed
1.000: Ratio of cycles / instruction
196: Clock rate in MHz
R10000: Target processor modeled
---------------------------------------------------------
Procedures sorted in descending order of cycles executed.
Unexecuted procedures are not listed. Procedures
beginning with *DF* are dummy functions and represent
init, fini and stub sections.
---------------------------------------------------------
cycles(%) cum % secs instrns calls procedure(dso:file)
68026368(23.79) 23.79 0.35 68026368 32768 xsweep(adi2.pixie:adi2.f)
68026368(23.79) 47.59 0.35 68026368 32768 ysweep(adi2.pixie:adi2.f)
68026368(23.79) 71.38 0.35 68026368 32768 zsweep(adi2.pixie:adi2.f)
35651584(12.47) 83.85 0.18 35651584 2097152 rand_(./libftn.so.pixn32:../../libF77/rand_.c)
27262976( 9.54) 93.39 0.14 27262976 2097152 irand_(./libftn.so.pixn32:../../libF77/rand_.c)
18874113( 6.60) 99.99 0.10 18874113 1 fake_adi(adi2.pixie:adi2.f)
11508( 0.00) 99.99 0.00 11508 5 memset(./libc.so.1.pixn32:/slayer_xlv0/ficussg-nov05/work/irix/lib/libc/libc_n32_M4/strings/bzero.s)
3101( 0.00) 99.99 0.00 3101 55 __flsbuf(./libc.so.1.pixn32:_flsbuf.c)
2446( 0.00) 100.00 0.00 2446 42 x_putc(./libftn.so.pixn32:../../libI77/wsfe.c)
1234( 0.00) 100.00 0.00 1234 2 x_wEND(./libftn.so.pixn32:../../libI77/wsfe.c)
1047( 0.00) 100.00 0.00 1047 1 f_exit(./libftn.so.pixn32:../../libI77/close.c)
1005( 0.00) 100.00 0.00 1005 5 fflush(./libc.so.1.pixn32:flush.c)
639( 0.00) 100.00 0.00 639 4 do_fio64_mp(./libftn.so.pixn32:../../libI77/fmt.c)
566( 0.00) 100.00 0.00 566 3 wrt_AP(./libftn.so.pixn32:../../libI77/wrtfmt.c)
495( 0.00) 100.00 0.00 495 6 map_luno(./libftn.so.pixn32:../../libI77/util.c)
458( 0.00) 100.00 0.00 458 14 op_gen(./libftn.so.pixn32:../../libI77/fmt.c)
440( 0.00) 100.00 0.00 440 9 gt_num(./libftn.so.pixn32:../../libI77/fmt.c)
414( 0.00) 100.00 0.00 414 1 getenv(./libc.so.1.pixn32:getenv.c)
.
.
.
The -heavy option adds a list of source lines, sorted by their consumption of ideal instruction cycles.
An ideal profile shows exactly and repeatedly which statements are most executed and gives you an exact view of the
algorithmic complexity of the program. An ideal profile does not necessarily reflect where a program spends its time since
it cannot take cache and TLB misses into account. Consequently, the results of the ideal profile are startlingly different from
that of the PC sampling profile. These ideal results indicate that zsweep should take exactly the same amount of run time as
ysweep and xsweep. These differences can be used to infer where cache performance issues exist. On machines without the
R10000's hardware profiling registers, such comparisons are the only profiling method available to locate cache problems.
3.3.4.3.1 Operation Counts
Since ideal profiling counts the instructions executed by the program, it can provide all sorts of interesting information
about the program. Already printed in the standard prof output are counts of how many times each subroutine is called. In
addition, you may use the -op option to prof to get a listing detailing the counts of all instructions in the program. In
particular, this will provide an exact count of the floating point operations executed:
% prof -op adi2.ideal.4920
Prof run at: Wed Jan 15 14:42:54 1997
Command line: prof -op adi2.ideal.4920
285898739: Total number of cycles
1.45867s: Total execution time
285898739: Total number of instructions executed
1.000: Ratio of cycles / instruction
196: Clock rate in MHz
R10000: Target processor modeled
---------------------------------------------------------
pixstats summary
---------------------------------------------------------
56590456: Floating point operations (38.796 Mflops @ 196 MHz)
105500230: Integer operations (72.3265 M intops @ 196 MHz)
.
.
.
Note that this is different from what you get using perfex. The R10000 counter #21 counts floating point instructions, not
floating point operations. As a result, in a program that executes a lot of multiply-add instructions --- each of which
carries out two floating point operations --- perfex's MFLOPS statistic can be off by a factor of two. Since prof -op
records all instructions executed, it counts each multiply-add instruction as two floating point operations, thus providing the
correct tally. The Mlops figure it calculates, however, is based on the ideal time; to calculate floating point performance,
divide the number of floating point operations counted by prof -op by wall clock time.
Either method of profiling, PC sampling or ideal, can be applied to multiprocessor runs just as easily as it is applied to
single-processor runs; each thread of an application maintains its own histogram, and the histograms may be printed
individually or merged in any combination and printed as one profile.
3.3.4.4 Gprof
One limitation of the prof output for either PC sampling or ideal time is that the information reported contains no
information about the call hierarchy. That is, if the routine zsweep in the above example were called from two different
locations in the program, you would not know how much time results from the call at each location; you would only know
the total time spent in zsweep. If you knew that, say, the first location was responsible for the majority of the time, this
could affect how you tune the program. For example, you might try inlining the call into the first location, but not bother
with the second. Or, if you wanted to parallelize the program, knowing that the first location is where the majority of the
time is spent, you might consider parallelizing the calls to zsweep there rather than trying to parallelize the zsweep
routine itself.
SpeedShop provides two methods of obtaining hierarchical profiling information. The first method, which is called gprof, is
used in conjunction with the ideal time profile. To obtain the gprof information for the above example, simply add the flag
-gprof to the prof command:
% prof -gprof adi2.ideal.4920
Prof run at: Wed Jan 15 16:52:09 1997
Command line: prof -gprof adi2.ideal.4920
285898739: Total number of cycles
1.45867s: Total execution time
285898739: Total number of instructions executed
1.000: Ratio of cycles / instruction
196: Clock rate in MHz
R10000: Target processor modeled
---------------------------------------------------------
Procedures sorted in descending order of cycles executed.
Unexecuted procedures are not listed. Procedures
beginning with *DF* are dummy functions and represent
init, fini and stub sections.
---------------------------------------------------------
cycles(%) cum % secs instrns calls procedure(dso:file)
68026368(23.79) 23.79 0.35 68026368 32768 xsweep(adi2.pixie:adi2.f)
68026368(23.79) 47.59 0.35 68026368 32768 ysweep(adi2.pixie:adi2.f)
68026368(23.79) 71.38 0.35 68026368 32768 zsweep(adi2.pixie:adi2.f)
35651584(12.47) 83.85 0.18 35651584 2097152 rand_(./libftn.so.pixn32:../../libF77/rand_.c)
27262976( 9.54) 93.39 0.14 27262976 2097152 irand_(./libftn.so.pixn32:../../libF77/rand_.c)
.
.
.
All times are in milliseconds.
--------------------------------------------------------------------------------
NOTE: any functions which are not part of the call
graph are listed at the end of the gprof listing
--------------------------------------------------------------------------------
self kids called/total parents
index cycles(%) self(%) kids(%) called+self name index
self kids called/total children
[1] 285895481(100.00%) 57( 0.00%) 285895424(100.00%) 0 __start [1]
50 285895369 1/1 main [2]
3 0 1/1 __istart [112]
2 0 1/1 __readenv_sigfpe [113]
--------------------------------------------------------------------------------
50 285895369 1/1 __start [1]
[2] 285895419(100.00%) 50( 0.00%) 285895369(100.00%) 1 main [2]
18874113 267020445 1/1 fake_adi [3]
205 606 5/5 signal [44]
--------------------------------------------------------------------------------
18874113 267020445 1/1 main [2]
[3] 285894558(100.00%) 18874113( 6.60%) 267020445(93.40%) 1 fake_adi [3]
68026368 0 32768/32768 zsweep [4]
68026368 0 32768/32768 ysweep [5]
68026368 0 32768/32768 xsweep [6]
35651584 27262976 2097152/2097152 rand_ [7]
28 13486 2/2 s_wsfe64 [9]
26 5368 2/2 e_wsfe [17]
22 2610 1/1 do_fioxr4v [25]
22 2610 1/1 do_fioxr8v [24]
23 2428 1/1 s_stop [28]
114 44 2/2 dtime_ [68]
--------------------------------------------------------------------------------
68026368 0 32768/32768 fake_adi [3]
[4] 68026368(23.79%) 68026368(100.00%) 0( 0.00%) 32768 zsweep [4]
--------------------------------------------------------------------------------
68026368 0 32768/32768 fake_adi [3]
[5] 68026368(23.79%) 68026368(100.00%) 0( 0.00%) 32768 ysweep [5]
--------------------------------------------------------------------------------
68026368 0 32768/32768 fake_adi [3]
[6] 68026368(23.79%) 68026368(100.00%) 0( 0.00%) 32768 xsweep [6]
--------------------------------------------------------------------------------
.
.
.
This produces the usual ideal time profiling output, but following that is the hierarchical information. There is a block of
information for each subroutine in the program. A number, shown in brackets (e.g., [1]), is assigned to each routine so that
the information pertaining to it can easily be located in the output. Let's look in detail at the block of information provided;
we'll use fake_adi [3] as an example.
The line beginning with the number [3] shows, from left to right, the:
Number of cycles consumed by this routine and the routines it calls (its descendants)
Number of cycles spent inside the routine, but not in any of its descendants
Number of cycles spent in its descendants
Total number of times the routine was called
Name of the routine, fake_adi [3]
Above this line are lines showing which routines fake_adi [3] was called from. In this case, it is only called from one
place, main [2], so there is just one line (in general, there would be one line for each routine which calls fake_adi
[3]). This line shows
The proportion of the cycles spent inside fake_adi [3] as a result of the call from main [2]
The proportion of time spent in fake_adi [3]'s descendants as a result of the call from main [2]
How many calls there are to fake_adi [3] in main [2] / the total number of calls to fake_adi [3] from all
places in the program
Since fake_adi [3] is only called once, all the time in it and its descendants is da result of this one call.
Below the line beginning with the number [3] are all the descendants of fake_adi [3]. For each descendant you see:
The proportion of the descendant's cycles spent inside it
The proportion of the descendant's cycles spent in its descendants (i.e., fake_adi [3]'s grandchildren)
How many calls to the descendant there are in fake_adi [3] / the total number of calls to the descendant from all
places in the program
This block of information allows you to determine not just which subroutines but which paths in the program are
responsible for the majority of time. The only limitation is that gprof reports ideal time, so cache misses are not represented.
3.3.4.5 Usertime Profiling
gprof only reports ideal time. To get hierarchical profiling information that accurately accounts for all the time in the
program, the way PC sampling does, use usertime profiling. For this type of profiling, the program is sampled during the
run. At each sample, the location of the program counter is noted and the entire call stack is traced to record which routines
have been called to get to this point in the program. From this, a hierarchical profile is constructed. Since unwinding the call
stack is an expensive operation, the sampling period is for usertime profiling is relatively long: 30 milliseconds.
usertime profiling is performed with the following command:
% ssrun -usertime adi2
The output is written to a file called ./adi2.usertime.pid, where pid is the process ID for this run of adi2. The profile is
displayed using prof just as for PC sampling and ideal time profiling:
% prof adi2.usertime.19572
The output is as follows:
-------------------------------------------------------------------------------
Profile listing generated Wed Jan 15 16:57:10 1997
with: prof adi2.usertime.19572
-------------------------------------------------------------------------------
Total Time (secs) : 9.99
Total Samples : 333
Stack backtrace failed: 0
Sample interval (ms) : 30
CPU : R10000
FPU : R10010
Clock : 196.0MHz
Number of CPUs : 1
-------------------------------------------------------------------------------
index %Samples self descendents total name
[1] 100.0% 0.00 9.99 333 __start
[2] 100.0% 0.00 9.99 333 main
[3] 100.0% 0.09 9.90 333 fake_adi
[4] 80.8% 8.07 0.00 269 zsweep
[5] 7.5% 0.75 0.00 25 xsweep
[6] 6.9% 0.69 0.00 23 ysweep
[7] 3.9% 0.12 0.27 13 rand_
[8] 2.7% 0.27 0.00 9 irand_
The information is less detailed than that provided by gprof, but when combined with gprof, you can get a complete
hierarchical profile for all routines which have run long enough to be sampled.
Multiprocessor Programming
4.1 Introduction
The selling point of a multiprocessor is that you can apply the power of multiple CPUs to your program, so that it completes
in less time. When we start running a program concurrently on multiple CPUs, we discover:
There is a mathematical limit on how fast a real program can go.
A variety of programming models can be used to express parallelism.
Parallel processing introduces new bottlenecks, but there are additional
tools to deal with them.
You can write a program that expresses parallel execution in the source code, or you can let the Power Fortran or Power C
compiler find the parallelism that is implicit in a conventional, serial program.
4.2 Parallel Speedup and Amdahl's Law
When your program runs on more than one CPU, its total run time should be less. But how much less? And what are the
limits on the speedup?
4.2.1 Adding CPUs to Shorten Execution Time
You can make your program run faster by distributing the work it does over multiple CPUs. There is always some part of
the program's logic that has to be executed serially, by a single CPU. However, suppose there is one loop where the program
spends 50% of the execution time. If you can divide the iterations of this loop so that half of them are done in one CPU
while the other half are done at the same time in a different CPU, the whole loop can be finished in half the time. The result:
a 25% reduction in program execution time.
The following topics cover the formal mathematical treatment of these ideas (Amdahl's law). There are two basic limits to
the speedup you can achieve by parallel execution:
The fraction of the program that can be run in parallel, p, is never 100%.
Because of hardware constraints, after a certain point, there is less and
less benefit from each added CPU.
Tuning for parallel execution comes down to battling these two limits. You strive to increase the parallel fraction, p,
because in some cases even a small change in p (from 0.8 to 0.85, for example) makes a dramatic
change in the effectiveness of added CPUs.
And you work to ensure that each added CPU does a full CPU's work and does not interfere with the work of other CPUs. In
the Origin architectures this means
Evenly spreading the workload among the CPUs
Eliminating false sharing and other types of memory contention between CPUs
Making sure that the data used by each CPU are located in a memory near that CPU's node
There are two ways to obtain the use of multiple CPUs. You can write your source code to use explicit parallelism ---
showing in the source code which parts of the program are to execute asynchronously and how the parts are to coordinate
with each other.
Alternatively you can take a conventional program in C, C++, or Fortran, and apply a precompiler to it to find the
parallelism that is implicit in the code. These precompilers, IRIS Power C and Power Fortran, insert the source directives
for parallel execution for you.
4.2.2 Parallel Speedup
If half the iterations of a DO-loop are performed on one CPU, and the other half run at the same time on a second CPU, the
whole DO-loop should complete in half the time. For example,
for (j=0; j==MAX; ++j) { z[j] = a[j]*b[j]; }
The IRIS Power C package can automatically distribute such a loop over n CPUs (with n decided at run time based on the
available hardware), so that each CPU performs MAX/n iterations.
The speedup gained from applying n CPUs, Speedup(n), is the ratio of the one-CPU execution time to the n-CPU parallel
execution time:
Speedup(n) = T(1)/T(n).
If you measure the one-CPU execution time of a program at 100 seconds, and the
program runs in 60 seconds with 2 CPUs, Speedup(2) = 100/60 = 1.67.
This number captures the improvement from adding hardware to the system. We expect T(n) to be less than T(1); if it is not,
adding CPUs has made the program slower, and something is wrong! So Speedup(n) should be a number greater than 1.0,
and the greater it is, the more pleased we are. Intuitively you might hope that the speedup would be equal to the number of
CPUs --- twice as many CPUs, half the time --- but this ideal can never be achieved (well, almost never).
Normally, the number Speedup(n) must be less than n, reflecting the fact that not all parts of a program benefit from
parallel execution. However, it is possible --- in rare situations --- for Speedup(n) to be larger than n. This is called a
isuperlinear speedup --- the program has been sped up by more than the increase of CPUs.
A superlinear speedup does not really result from parallel execution. It comes about because each CPU is now working on a
smaller set of memory. The problem data handled by any one CPU fits better in cache, so each CPU executes faster than the
single CPU could do. A superlinear speedup is welcome, but it indicates that the sequential program was being held back by
cache effects.
4.2.3 Amdahl's Law
There are always parts of a program that you cannot make parallel --- code that must run serially. For example, consider
the DO-loop. Some amount of code is devoted to setting up the loop, allocating the work between CPUs. Then comes the
parallel part, with all CPUs running concurrently. At the end of the loop is more housekeeping that must be done serially;
for example, if n does not divide MAX evenly, one CPU must execute the few iterations that are left over. The serial parts
of the program cannot be speeded up by concurrency.
The mathematical statement of this idea is called Amdahl's law. Let p be the fraction of the program's code that can be
made parallel (p is always a fraction less than 1.0.) The remaining (1-p) of the code must run serially. In practical cases, p
ranges from 0.2 to 0.99.
The potential speedup for a program is proportional to p divided by the CPUs you can apply, plus the remaining serial part,
(1-p):
1
Speedup(n) = ----------- (Amdahl's law: Speedup(n) given p)
(p/n)+(1-p)
Suppose p = 0.8; then Speedup(2 )= 1/(0.4+0.2) = 1.67, and Speedup(4) = 1/(0.2+0.2) = 2.5.
The maximum possible speedup
--- if you could apply an infinite number of CPUs --- would be 1/(1-p).
In particular, the more CPUs you have, the more benefit you get from increasing p. Using only 4 CPUs, you need only p =
0.6 to get half the ideal speedup. With 8 CPUs, you need p = 0.85 to get half the ideal speedup.
4.2.4 Calculating the Parallel Fraction of a Program
You do not have to guess at the value of p for a given program. Measure the execution times T(1) and T(2) to calculate a
measured Speedup(2) = T(1)/T(2). The Amdahl's law equation can be rearranged to yield p when Speedup(n) is known:
2 SpeedUp(2) - 1
p = --- * -------------- (Amdahl's law: p given Speedup(2))
1 SpeedUp(2)
Suppose you measure T(1) = 188 seconds and T(2) = 104 seconds.
SpeedUp(2) = 188/104 = 1.81
p = 2 * ((1.81-1)/1.81) = 2*(0.81/1.81) = 0.895
4.2.5 Predicting Execution Time with n CPUs
In some cases, the Speedup(2) = T(1)/T(2) is a value greater than 2, in other words, a superlinear speedup (described earlier).
When this occurs, the formula in the preceding section returns a value of p greater than 1.0, clearly not useful. In this case
you need to calculate p from two other more realistic timings, for example T(2) and T(3). The general formula for p is:
Speedup(n) - Speedup(m)
p = -------------------------------------------
(1 - 1/n)*Speedup(n) - (1 - 1/m)*Speedup(m)
where n and m are the two processor counts whose speedups are known. You can use this calculated value of p to extrapolate
the potential speedup with higher numbers of processors. For example, what is the expected time with 4 CPUs?
Speedup(4)= 1/((0.895/4)+(1-0.895)) = 3.04
T(4)= T(1)/Speedup(4) = 188/3.04 = 61.8
4.5.5 Cache Friendly Programs
Data placement issues won't arise for all parallel programs: those which are cache friendly do not incur performance
penalties if the data placement is not optimal. This is because such programs satisfy their memory requests primarily from
cache, rather than main memory, and the time is the same for each processor to access its caches. Thus, if some memory
requests take a bit longer than ideal, the net effect on the program will be negligible. So data placement can only be an issue
for parallel programs which are very memory intensive and are not cache friendly.
You can determine how cache friendly a program is using perfex(1). The command
% perfex -a -x -y a.out
will tell you how many primary and secondary cache misses the program a.out generates, what the cache hit ratios are, and
it will estimate how much the cache misses cost. If they only account for a small percentage of the run time, then the
program makes good use of the cache and its performance will not be affected by data placement.
On the other hand, if the time spent in cache misses is high, then the program is not cache friendly and data placement could
affect performance. But there are other issues common to all cache-based shared memory platforms which are more likely
to be the source of performance problems, and we consider these in the next section.
4.5.6 Non Cache Friendly Programs
If a properly parallelized MP library program does not scale as well as expected, there are several potential causes. The first
thing you need to check for is whether false sharing or other forms of cache coherency contention are a problem.
Cache coherency contention can arise in any cache-based multiprocessor system. It can only be an issue, however, for data
that are frequently updated or written. Data that are mostly read and rarely written do not cause cache coherency contention.
When one CPU modifies a cache line, any
other CPU that has a copy of the same cache line is notified, discards its copy, and fetches a new copy when it needs those
data again. This can cause performance problems in two cases:
1.
If one CPU repeatedly updates a cache line that other CPUs use for input, all the reading CPUs are forced to
frequently retrieve a new copy of that cache line from memory. This slows all the reading CPUs.
2.
If two or more CPUs repeatedly update the same cache line, they contend for the exclusive ownership of the cache
line. Each CPU has to get ownership and fetch a new copy of the cache line before it can perform its update. This
forces the updating CPUs to execute serially, as well as making all CPUs fetch a new copy of the cache line on every
use.
The first of these is generic cache coherency contention. The latter is referred to as false sharing. Fortunately, the tools
available for Origin can help you identify these problems. In both cases, perfex(1) will reveal a high number of cache
invalidation events (cf. hardware counters 31, 12, 13, 28, and 29 --- refer to the description of perfex for details of using
the hardware counters). In addition, the CPU(s) repeatedly updating the same cache line shows a high count of stores to
shared cache lines (counter 31). Some examples of how these problems occur and what you can do to fix them are now
presented.
4.5.6.1 False Sharing
False sharing is best demonstrated with an example. Consider the following code:
integer m, n, i, j
real a(m,n), s(m)
c$doacross local(i,j), shared(s,a)
do i = 1, m
s(i) = 0.0
do j = 1, n
s(i) = s(i) + a(i,j)
enddo
enddo
This code calculates the sums of the rows of a matrix. For simplicity, assume m = 4 and that the code is run on up to four
processors. What you observe is that the time for the parallel runs is longer than when just one processor is used. To
understand what causes this, let's look at what happens when this loop is run in parallel.
The following is a timeline of the operations which are carried out (more or less) simultaneously:
t = 0 t = 1 t = 2 ...
s(1) = 0.0 s(1) = s(1) + a(1,1) s(1) = s(1) + a(1,2) ...
s(2) = 0.0 s(2) = s(2) + a(2,1) s(2) = s(2) + a(2,2) ...
s(3) = 0.0 s(3) = s(3) + a(3,1) s(3) = s(3) + a(3,2) ...
s(4) = 0.0 s(4) = s(4) + a(4,1) s(4) = s(4) + a(4,2) ...
At each stage of the calculation, all four processors attempt to concurrently update one element of the sum array, s(m). For
a processor to update one element of s, it needs to gain exclusive access to the cache line holding the element it wishes to
update. But since s is only four words in size, it is likely that s is contained entirely in a single cache line, so each processor
needs exclusive access to all of s. Thus, only one processor at a time can update an element of s. Instead of operating in
parallel, the calculation is serialized.
Actually, it's a bit worse than just serialized. For a processor to gain exclusive access to a cache line, it first needs to
invalidate any cached copies of s which may reside in the other processors. Then it needs to read a fresh copy of the cache
line from memory since the invalidations will have caused data in some other processor's cache to be written back to main
memory. So, whereas in a sequential version, the element of s being updated can be kept in a register, in the parallel version,
false sharing forces the value to continually be reread from memory, in addition to serializing the updates.
This serialization is purely a result of the unfortunate accident that the different elements of s ended up in the same cache
line. If each element were in a separate cache line, each processor could keep a copy of the appropriate line in its cache, and
the calculations could be done perfectly in parallel. This, then, shows how to fix the problem. All you need do is spread out
the elements of s so that each is in its own cache line. Here is a simple way to do this:
integer m, n, i, j
real a(m,n), s(32,m)
c$doacross local(i,j), shared(s,a)
do i = 1, m
s(1,i) = 0.0
do j = 1, n
s(1,i) = s(1,i) + a(i,j)
enddo
enddo
Convert s to a two-dimensional array with the first dimension one (secondary) cache line in size.
Tuning Highlights |
PECM |
SCD |
Tips