There are two approaches to tuning the performance of a parallel application.
With this approach, the process is the same as for any sequential program, and you use the same tools; prof, gprof, and tprof. In this case, the parallelization process must take performance into account, and should avoid anything that adversely affects it.
Both of these techniques yield comparable results. The difference is in the tools that are used in each of the approaches, and how they are used.
With either approach, you use the standard sequential tools in the traditional manner. When you tune an application and then parallelize it, observe the communication performance, how it affects the performance of each of the individual tasks, and how the tasks affect each other. For example, does one task spend a lot of time waiting for messages from another? If so, perhaps you need to rebalance the workload. Or if a task starts waiting for a message long before it arrives, perhaps it could do more algorithmic processing before waiting for the message. When an application is parallelized and then tuned, you need a way to collect the performance data in a manner that includes both communication and algorithmic information. That way, if the performance of a task needs to be improved, you can decide between tuning the algorithm or tuning the communication.
This section will not deal with standard algorithmic tuning techniques. Rather, we will discuss some of the ways PE can help you tune the parallel nature of your application, regardless of the approach you take. To illustrate this, we'll use two examples.