Fast Hints for Performance Estimates on SGI systems
The most common metric for reporting relative performance of application
codes across machines is MFLOPS. Assuming a code executes a similar mix
of arithmetic operations (ignoring any variances in precision, rounding,
etc.) to solve identical problems on different machines,
MFLOPS is a way of characterizing the performance of an application
independent of its underlying algorithm, thus allowing different applications
to be compared on the same computer, or different computers to be compared
running the same application.
Unfortunately, very few machines have direct hardware interfaces for
accurately measuring the MFLOP rate of an executed program. Cray series
machines with HPM (Hardware Performance Monitor) allow this capability, and
SGI series (R10000) have hardware counters but require an indirect process
to accurately count MFLOPS for an executable.
If your code already runs on a Cray machine with HPM, you can derive the
MFLOP rate for any other machine (e.g. SGI) by simply scaling the Cray HPM
MFLOP rate by the ratio of the CPU times. For example, if your code runs
to completion in 100 seconds (single processor) on a Cray C90, and HPM tells you it performed
at 250 MFLOPS, and it then runs in 200 seconds (single processor) on another computer, you
can estimate that it runs at 125 MFLOPS on that machine.
You can get a quick estimate of the MFLOP performance of your code on the
SGI using the perfex command. In the worst
case, however, the number reported by
"perfex -e 21" (to count MFLOPS)
can be
off by a factor of two because perfex counts the the Mips mult-add operation
as one floating point operation, even though it really accomplishes two.
Of course, the variation will depend on how many floating point operations
in your code are actually mult-adds.
Because of this inaccuracy, the recommended way of computing the MFLOPS
for a code involves first instrumenting the code to count FLOPS using
speedshop and then dividing this count by
the CPU execution time of a run made without the instrumentation overhead.
See the timex command for information on
obtaining the elapsed time, user time, and system times for a program.
If you have a parallel application, ideally you should run the code on a
dedicated partition of processors, and measure the real (wallclock) time to
obtain scalability data for varying numbers of processors. Presuming the
MFLOP count remains constant for a given problem size over the range of
processors, you can just continue to divide by the wallclock time to obtain
a total MFLOP rate.
If you have some impressive single processor or parallel scaling results,
please send them to us.
Tuning Highlights |
PECM |
SCD |
Tips