NCSA Home
Contact Us | Intranet | Search

ncsa

Fast Hints for Performance Estimates on SGI systems

The most common metric for reporting relative performance of application codes across machines is MFLOPS. Assuming a code executes a similar mix of arithmetic operations (ignoring any variances in precision, rounding, etc.) to solve identical problems on different machines, MFLOPS is a way of characterizing the performance of an application independent of its underlying algorithm, thus allowing different applications to be compared on the same computer, or different computers to be compared running the same application. Unfortunately, very few machines have direct hardware interfaces for accurately measuring the MFLOP rate of an executed program. Cray series machines with HPM (Hardware Performance Monitor) allow this capability, and SGI series (R10000) have hardware counters but require an indirect process to accurately count MFLOPS for an executable.

If your code already runs on a Cray machine with HPM, you can derive the MFLOP rate for any other machine (e.g. SGI) by simply scaling the Cray HPM MFLOP rate by the ratio of the CPU times. For example, if your code runs to completion in 100 seconds (single processor) on a Cray C90, and HPM tells you it performed at 250 MFLOPS, and it then runs in 200 seconds (single processor) on another computer, you can estimate that it runs at 125 MFLOPS on that machine.

You can get a quick estimate of the MFLOP performance of your code on the SGI using the perfex command. In the worst case, however, the number reported by
"perfex -e 21" (to count MFLOPS)
can be off by a factor of two because perfex counts the the Mips mult-add operation as one floating point operation, even though it really accomplishes two. Of course, the variation will depend on how many floating point operations in your code are actually mult-adds.

Because of this inaccuracy, the recommended way of computing the MFLOPS for a code involves first instrumenting the code to count FLOPS using speedshop and then dividing this count by the CPU execution time of a run made without the instrumentation overhead. See the timex command for information on obtaining the elapsed time, user time, and system times for a program.

If you have a parallel application, ideally you should run the code on a dedicated partition of processors, and measure the real (wallclock) time to obtain scalability data for varying numbers of processors. Presuming the MFLOP count remains constant for a given problem size over the range of processors, you can just continue to divide by the wallclock time to obtain a total MFLOP rate.

If you have some impressive single processor or parallel scaling results, please send them to us.


Tuning Highlights | PECM | SCD | Tips