NCSA Home
Contact Us | Intranet | Search

PAPI at NCSA

PAPI at NCSA

The PAPI (Performance Application Programming Interface) library from the Innovative Computing Laboratory at the University of Tennessee-Knoxville is available on the Linux clusters, SGI Altix, and IBM p690 systems at NCSA. PAPI is an effort to establish a uniform, standard programming interface for accessing hardware performance counters on modern microprocessors. The PAPI web site is located at:

http://icl.cs.utk.edu/papi/

Hardware performance counters can be very useful for tuning the performance of applications and for evaluating the effectiveness of the compiler on your application. These counters allow you to directly measure the actual usage of the hardware as your application runs and may help you to diagnose bottlenecks in your application's performance. By using PAPI, you gain the benefit of a cross-platform interface to the counters, allowing you to maintain a common source for a wide variety of architectures.

The page you are currently viewing is an overview intended to provide information of interest and/or specific to users of PAPI at NCSA. You can find a number of documents that cover PAPI in more detail at the PAPI web site including The PAPI User's Guide, tailored for the end-user or person new to PAPI. The repository also includes a link to an in-depth tutorial by members of the PAPI development team entitled Performance Tuning Using Hardware Counter Data.

PAPI provides both a simple, high-level interface that may be suitable for your needs and also a low-level interface that gives you much more control over PAPI, including access to native hardware events that are not part of the PAPI standard event definitions. Neither the low-level interface nor accessing native events through PAPI are covered here; please refer to the PAPI web site and processor-specific documentation for details.

Note: the PAPI low-level API and mechanism for accessing native events have changed in PAPI 3. You will have to modify your source code if you are using these features of PAPI and want to use PAPI 3. Detailed instructions on converting applications and tools to the PAPI 3 API are available at the main PAPI web site. If you are using the PAPI high-level API only, your source code should require no changes to use PAPI 3.

This page provides information on the following topics:


[an error occurred while processing this directive]

System Kernel PMU Support PAPI version Directory
Platinum Perfctr 2.4.9pl1 2.3.4 /usr/apps/tools/papi
3.0.0 beta /usr/apps/tools/papi3
Titan Perfmon 0.06a 2.0.1 /usr/apps/tools/papi
Copper   2.3.4.2 /usr/apps/tools/papi
3.0.0 beta /usr/apps/tools/papi3
Tungsten Perfctr 2.6.13 3.5.0 /usr/apps/tools/papi3
Mercury Perfmon 2.0 2.3.4 /usr/projects/perftools/papi
3.0.7 /usr/projects/perftools/papi3
Cobalt Perfmon 2.0 3.0.6 /usr/apps/tools/papi3


The PAPI directory contains the compiled libraries, include files, UNIX manual pages, and example programs from the PAPI distribution. You'll need to ensure that this directory is named as part of the search path for both include files as well as libraries during the compile and link process (see below). Add the directory /usr/apps/tools/papi/man to your MANPATH environment variable if you want to have the "man" command find the PAPI manual pages.


[an error occurred while processing this directive]

PAPI include files for Fortran

There are three different Fortran include files that you can choose from when compiling your PAPI-enabled Fortran program:
fpapi.h
This is an include file that requires C-style preprocessing. Several compilers will treat a Fortran source code file with the suffix ".F" (uppercase F) as a file that should be passed through the C preprocessor. Consult the documentation for the compiler you are using for specifics.
f77papi.h
This is a Fortran 77-style include file. This file requires no C preprocessing, so you may find it more convenient to use.
f90papi.h
This is a Fortran 90-style include file. Like f77papi.h, this file requires no C preprocessing, so you may find it more convenient to use.

PAPI libraries

If you link with the shared (.so) version of the library, you will have to specify where the PAPI shared library can be found at runtime. For example, on the Linux clusters you can:
  • Include the directory /usr/apps/tools/papi/lib in your LD_LIBRARY_PATH environment variable.
  • Specify the option:
    -Wl,-rpath,/usr/apps/tools/papi/lib
    
    when you link your executable. The -rpath option allows you to add directories to the runtime linker's search path.

If you link the static version of the PAPI library into your program, your executable should run without having to modify the LD_LIBRARY_PATH environment variable. You can cause the static version to be used by specifying -static at link time, or by including /usr/apps/tools/papi/lib/libpapi.a on your link command.

For all compilers, specify

	-I/usr/apps/tools/papi/include
at compile time, and specify
	-L/usr/apps/tools/papi/lib -lpapi
at the link step.

If you are using PAPI on the POWER4/AIX system (Copper), you will also want to append the PMAPI library (-lpmapi) at the link step, as follows:

        -L/usr/apps/tools/papi/lib -lpapi -lpmapi


[an error occurred while processing this directive]

Using PAPI_flops

Perhaps the easiest way to use the PAPI high-level functions (which may be sufficient for many users) is to call the routine PAPI_flops (or in Fortran, PAPIF_flops). This routine, which may be called multiple times from a single-threaded program, is an easy way to measure wall-clock time, CPU time, the number of floating point instructions executed, and the MFLOP rate.

Here's an example of using PAPI_flops from Fortran:

      include 'f77papi.h'
      real real_time, cpu_time, mflops
      integer*8 fp_ins
      integer ierr

C Call PAPIF_flops to get things started.  This will initialize PAPI
C and start the counters running.  Each of these calls return an
C error code in the 'ierr' parameter.  See below for details on
C how to manage this.

      call PAPIF_flops(real_time, cpu_time, fp_ins, mflops, ierr)

C Do some computation

      call compute() 

C Read the values in the counters and print them out.  Any call to 
C PAPIF_flops with fp_ins set to the value -1 will reinitialize
C all counters to zero.  You might want to do this in order 
C to individually time different portions of your application.

      call PAPIF_flops(real_time, cpu_time, fp_ins, mflops, ierr)

      write (*,100) real_time, cpu_time, fp_ins, mflops

100   format('           Real time (secs) :', f15.3, 
     +      /'            CPU time (secs) :', f15.3,
     +      /'Floating point instructions :', i15,
     +      /'                     MFLOPS :', f15.3)

Using the general PAPI high-level interface

Here's an example in Fortran of using the general high-level PAPI API, which allows you to count any available PAPI events of your choice:
  1. Include the proper PAPI constant definitions:
    	include 'f77papi.h'
    
  2. Declare the events you want to count and other error-related variables, for example:
           integer events (2), numevents, ierr
           character*(PAPI_MAX_STR_LEN) errorstring
    
  3. Declare variables to hold the event counts:
           integer*8 values (2)
    
  4. Set each event to the desired type, listed in f77papi.h (or below):
           numevents = 2
           events(1) = PAPI_FP_INS
           events(2) = PAPI_TOT_CYC
    
  5. Start and clear the counters:
           call PAPIF_start_counters(events, numevents, ierr)
    
  6. Do some computation, then read and reset them but leave them running:
           call PAPIF_read_counters(values, numevents, ierr)
    
    A similar routine, PAPIF_accum_counters, accepts the same arguments but adds the current values to the running totals already contained in the values array.
  7. Compute some more and then stop the counters and retrieve the values:
           call PAPIF_stop_counters(values, numevents, ierr)
    
  8. Each of those calls returns an error code that you can handle this way:
           if ( ierr .ne. PAPI_OK ) then
    	 call PAPIF_perror(ierr, errorstring, PAPI_MAX_STR_LEN)
    	 print *, errorstring
           endif
    
A similar C sequence is:
	#include <papi.h>

	#define NUMEVENTS 2

	unsigned int events[NUMEVENTS] = {PAPI_FP_INS, PAPI_TOT_CYC};
	int errorcode;
	long long values[NUMEVENTS];
	char errorstring[PAPI_MAX_STR_LEN+1];

	errorcode = PAPI_start_counters(events, NUMEVENTS);

	/* Compute... */

	errorcode = PAPI_read_counters(values, NUMEVENTS);

	/* Compute some more... */

	errorcode = PAPI_stop_counters(values, NUMEVENTS);

	if (errorcode != PAPI_OK) {
	    PAPI_perror(errorcode, errorstring, PAPI_MAX_STR_LEN);
	    fprintf(stderr, "PAPI error (%d): %s\n", errorcode, errorstring);
	}


[an error occurred while processing this directive]

You can count two (Pentium III), four (Itanium and Itanium 2), eighteen (Xeon), or eight (POWER4) individual events, or you can alternatively "multiplex" the available physical counters over a larger number of events. Please refer to the PAPI web site for instructions on multiplexing.

Certain native hardware events are restricted to a subset of the available counters. The details of this are beyond the scope of this web page; refer to the Intel and IBM manuals for more information. In general though, you don't have to concern yourself with this when accessing counters through the PAPI software; the details are taken care of for you.

Below is a table of available hardware performance counter events on Pentium III, Xeon, Itanium, Itanium 2, and POWER4 that are reported by PAPI (considered "standard"). This is a subset of all 104 standard events that are defined by PAPI. Of these events, 45 are supported on Pentium, 19 are supported on Xeon, 43 are supported on Itanium, 56 are supported on Itanium 2, and 22 are supported on POWER4. They are listed here for convenience in determining what PAPI events you can measure on the Intel-based Linux clusters and IBM p690 systems at NCSA. You can find the full listing of PAPI standard events at the PAPI web site or in the include file papiStdEventDefs.h.

Note: not all of these events are available on all platforms. The table indicates which events are available on each processor, both in tabular form and by color-coding. Additionally, these listings refer to PAPI 2 with the exception of Tungsten (Pentium 4), which only supports PAPI 3 beta.

Legend:
Red: available only on Pentium III
Turqouise: available only on Xeon
Blue: available only on Itanium
Plum: available only on Itanium 2
Yellow:available only on POWER4
Green: available on all processors
White: available on some (but not all) processors

"*": available and measured by a single native event
"D": available and is a derived event (calculated from multiple native events)

Standard PAPI Events Available on NCSA Systems
NameDescriptionSystem
Platinum
(Pentium III)
Tungsten
(Xeon)

PAPI 3 only
Titan
(Itanium)

PAPI 2 only
Mercury
(Itanium 2)
Copper
(POWER4)
Conditional Branching
PAPI_BR_CNConditional branch instructions          
PAPI_BR_INSBranch instructions * * * *  
PAPI_BR_MSPConditional branch instructions mispredicted * * D D  
PAPI_BR_NTKConditional branch instructions not taken D * D    
PAPI_BR_PRCConditional branch instructions correctly predicted D * * *  
PAPI_BR_TKNConditional branch instructions taken * * D    
PAPI_BTAC_MBranch target address cache misses *        
Cache Requests
PAPI_CA_CLNRequests for exclusive access to clean cache line *        
PAPI_CA_INVRequests for cache line invalidation *     D  
PAPI_CA_ITVRequests for cache line intervention *        
PAPI_CA_SHRRequests for exclusive access to shared cache line *        
PAPI_CA_SNPRequests for a snoop       *  
Conditional Store
(no events available)
Floating Point Operations
PAPI_FLOPSFloating point instructions per second (PAPI 2 only) D   D D D
PAPI_FMA_INSFloating point multiply-add instructions completed         *
PAPI_FML_INSFloating point multiply instructions *        
PAPI_FDV_INSFloating point divide instructions *       *
PAPI_FSQ_INSFloating point square root instructions         *
PAPI_FP_INSFloating point instructions * D D * *
PAPI_FP_OPSFloating point operations (PAPI 3 only) * D   * D
Instruction Counting
PAPI_FXU_IDLCycles integer units are idle         *
PAPI_HW_INTHardware interrupts *       *
PAPI_INT_INSInteger instructions         *
PAPI_IPSInstructions per second (PAPI 2 only) D       D
PAPI_TOT_CYCTotal cycles * * * * *
PAPI_TOT_IISInstructions issued * *   * *
PAPI_TOT_INSInstructions completed * * * D *
PAPI_VEC_INSVector/SIMD instructions * D      
Cache Access
PAPI_L1_DCAL1 data cache accesses *   * * D
PAPI_L1_DCHL1 data cache hits D   D D  
PAPI_L1_DCRL1 data cache reads       * *
PAPI_L1_DCML1 data cache misses *   * * D
PAPI_L1_DCWL1 data cache writes         *
PAPI_L1_ICAL1 instruction cache accesses * *   D  
PAPI_L1_ICHL1 instruction cache hits D        
PAPI_L1_ICML1 instruction cache misses * * * *  
PAPI_L1_ICRL1 instruction cache reads *   D D  
PAPI_L1_ICWL1 instruction cache writes *        
PAPI_L1_LDML1 load misses *   D D *
PAPI_L1_STML1 store misses *       *
PAPI_L1_TCAL1 total cache accesses D        
PAPI_L1_TCML1 total cache misses *   D D  
 
PAPI_L2_DCAL2 data cache accesses D   * *  
PAPI_L2_DCHL2 data cache hits D     D  
PAPI_L2_DCML2 data cache misses     D *  
PAPI_L2_DCRL2 data cache reads *   * *  
PAPI_L2_DCWL2 data cache writes *   * *  
PAPI_L2_ICAL2 instruction cache accesses *        
PAPI_L2_ICML2 instruction cache misses     * *  
PAPI_L2_ICRL2 instruction cache reads *   D D  
PAPI_L2_LDML2 load misses     D *  
PAPI_L2_STML2 store misses     * *  
PAPI_L2_TCAL2 total cache accesses * *   *  
PAPI_L2_TCHL2 total cache hits   *   *  
PAPI_L2_TCML2 total cache misses * * * *  
PAPI_L2_TCRL2 total cache reads D     D  
PAPI_L2_TCWL2 total cache writes *        
 
PAPI_L3_DCAL3 data cache accesses     D *  
PAPI_L3_DCHL3 data cache hits     D D  
PAPI_L3_DCML3 data cache misses     D D  
PAPI_L3_DCRL3 data cache reads     * *  
PAPI_L3_DCWL3 data cache writes     * *  
PAPI_L3_ICHL3 instruction cache hits     * *  
PAPI_L3_ICML3 instruction cache misses     * *  
PAPI_L3_ICRL3 instruction cache reads     * *  
PAPI_L3_LDML3 load misses     D *  
PAPI_L3_STML3 store misses     * *  
PAPI_L3_TCAL3 total cache accesses   *   *  
PAPI_L3_TCHL3 total cache hits   *   D  
PAPI_L3_TCML3 total cache misses   * * *  
PAPI_L3_TCRL3 total cache reads       *  
PAPI_L3_TCWL3 total cache writes       *  
Data Access
PAPI_LD_INSLoad instructions   D * *  
PAPI_LST_INSLoad/store instructions completed   D D    
PAPI_FP_STALCycles the floating point units are stalled       *  
PAPI_MEM_SCYCycles stalled waiting for memory access     *    
PAPI_RES_STLCycles stalled on any resource * *   *  
PAPI_SR_INSStore instructions   D * *  
PAPI_STL_CCYCycles with no instructions completed       *  
PAPI_STL_ICYCycles with no instruction issue     * * *
TLB Operations
PAPI_TLB_DMData translation lookaside buffer misses   * * * *
PAPI_TLB_IMInstruction translation lookaside buffer misses * * * * *
PAPI_TLB_TLTotal translation lookaside buffer misses   *   D D


[an error occurred while processing this directive]

Much more detailed information about the hardware performance counters on Pentium III, Pentium 4, Xeon, Itanium, and Itanium 2, including a complete listing of all native events available on these processors, can be found at Intel's web site:

Intel ® Architecture Optimization Reference Manual (Pentium III)

IA-32 Intel ® Architecture Optimization Reference Manual (see also IA-32 Intel ® Architecture Software Developer's Manual, Volume 3: System Programming Guide) (Pentium 4, Xeon, Pentium M)

Intel ® Itanium ® Processor Reference Manual for Software Development

Intel ® Itanium ® 2 Processor Reference Manual for Software Development and Optimization


[an error occurred while processing this directive]

There are currently no reference manuals that list available native events for the POWER4 architecture from IBM, but you can review the files within the directory /usr/pmapi/lib to see what events are available on these processors. In particular, the files:

POWER4.{evs,gps}

provide information about individual performance events as well as "event group" information.

You may also find the IBM document PowerPC Architecture Book helpful.


[an error occurred while processing this directive]

How many hardware performance counters are there on Pentium III, Xeon, Itanium, Itanium 2, and POWER4 processors?

There are two counters on Pentium III, eighteen on Xeon, four on Itanium, four on Itanium 2, and eight on POWER4.

Why is PAPIF_flops returning bad numbers for times and MFLOPS? I know they're not correct.

Make sure that you aren't passing in double-precision variables. This might happen if you specify the -r8 flag to the Fortran compiler, for example. PAPIF_flops expects a 32-bit floating point number for the times and MFLOP arguments. Try declaring the variables you pass to PAPIF_flops as real*4.

Are the floating point operations reported by PAPI accurate?

On Pentium, PAPI bases its count of floating point operations on the native event FLOPS. On Itanium, PAPI calculates the number of floating point operations using the following formula:
    FP_OPS_RETIRED_HI*4 + FP_OPS_RETIRED_LO

On Itanium 2, PAPI uses the native event FP_OPS_RETIRED to count floating point operations.

These should give you an accurate count of total floating point operations retired by your code on NCSA Linux clusters (in contrast, the MIPS R10000 counters on the SGI Origin, for example, count a fused multiply-add instruction as a single floating-point operation).

Note: On Pentium 3, SSE/SSE2 vector operations are not included in the floating point operation count. On Pentium 4, PAPI 3 includes SSE2 instructions in the floating point operation count, but you may want to adjust environment variables to count the exact type of vector floating point operations of interest. Please refer to the PAPI documentation for more information.

On POWER4, the native event PM_FPU_FIN is used. You should be aware that this event alone does not accurately measure the floating point operation count of the application and should use other tools such as the HPM Toolkit or programming using PMAPI until this is resolved.

Where do I find Perfometer and Dynaprof for PAPI on NCSA systems?

Neither Perfometer nor Dynaprof are currently installed at NCSA. Please contact NCSA support staff if you have a need for these tools.

Are there any utilities that allow me to access the performance counters without modifying or relinking my code?

Yes. Here is a synopsis of these utilities:

Recommended (for ease-of-use)

A command-line utility "psrun", is available on the Platinum, Titan, Tungsten, and Mercury clusters. psrun uses PAPI as the underlying support for accessing the performance counters. psrun was developed by the PerfSuite project at NCSA. It supports the option "-h" to access brief online help on usage.

On Copper, the utility "hpmcount", written by Luiz DeRose of IBM ACTC, can be used to measure hardware performance counter data with an unmodified application. hpmcount supports the option "-h" to access brief online help on usage. hpmcount uses the AIX PMAPI kernel interface to access the counters on the p690, not PAPI. hpmcount is part of IBM's HPM Toolkit.

Also Available (but more complex)

There are two other utilities that require you to work with the performance counter events native to each system (refer to the Intel documents listed above for details).

On Platinum and Tungsten, a utility called "perfex" allows you to measure native events for an arbitrary program from the command line. perfex was written by Mikael Pettersson of Uppsala University, Sweden (author of the IA-32 performance counter driver). Although very flexible, perfex can be rather difficult to use, so we recommend that you first try psrun on the IA-32 clusters.

On Titan, a similar utility called "pfmon" is available. pfmon was written by Stephane Eranian of Hewlett-Packard (author of the IA-64 performance counter driver).

Multiplexing support

Unlike SGI's "perfex", most of the above utilities do not support multiplexing of the performance counters ("psrun" is the exception).

For more detailed information about these tools and their use at NCSA:

You can also check the official PAPI FAQ if we haven't answered your question here.