There are some things you need to consider when you want to get the maximum performance out of your program.
A program that has only one MPI communication thread may set the environment variable MP_SINGLE_THREAD=yes before calling MPI_Init. This will avoid some locking which is otherwise required to maintain consistent internal MPI state. The program may have other threads that do computation or other work, as long as they do not make MPI calls. Note that the implementation of MPI I/O and MPI one-sided communication is thread-based, and that these facilities may not be used when MP_SINGLE_THREAD is set to yes.
The MP_EUIDEVELOP environment variable lets you control how much checking is done when you run your program. Eliminating checking altogether (setting MP_EUIDEVELOP to min) provides performance (latency) benefits, but may cause critical information to be unavailable if your executable hangs due to message passing errors. For more information on MP_EUIDEVELOP and other POE environment variables, see IBM Parallel Environment for AIX: Operation and Use, Vol. 1.
The profile results (gmon.out) will contain only a summary of the information from all the threads per task together. Viewing the data using gprof or Xprofiler is limited to showing only this summarized data on a per task basis, not per thread.
For more information on profiling, see IBM Parallel Environment for AIX: Operation and Use, Vol. 2.