IBM Books

Hitchhiker's Guide


Why is this so slow?

So, you've got a serial program and you want it to execute

faster. In this situation, it's best not to jump into parallelizing your program right away. Instead, you start by tuning your serial algorithm.

The program we'll use in this next example approximates the two-dimensional Laplace equation and, in our case, uses a 4-point stencil.

Our algorithm is very straightforward. For each array element, we'll assign that element the average of the four elements that are adjacent to it (except the rows and columns that represent the boundary conditions of the problem).

Note: You may find it helpful to refer to In Search of Clusters by Gregory Phister for more information on this problem and how to parallelize it.

Note that the 4-point stencil program is central to this entire section, so you may want to spend some time to understand how it works.

The first step is to compile our serial program. However, before you do this, be sure you have a copy of stencil.dat in your program directory, or run the init program to generate one. Once we've done this, we can compile our serial program with the xlf command:

$ xlf -O2 naive.f -o naive

Next, we need to run the program and collect some information to see how it performs. You can use the UNIX time command to do this:

$ time naive

The following table shows the result:

Program Name Tasks Wallclock Time Array Size per Task
naive 1 (single processor) 11m1.94s 1000x1000

Note: The figures in the table above, as well as the others in this section, provide results that were gathered on an IBM RS/6000 SP. Your execution time may vary, depending on the system you're using.

Looking at these results, we can see that there's some room for improvement, especially if we scale the problem to a much larger array. So, how can we improve the performance?

Profile it

The first step in tuning our program is to find the areas within it that execute most of the work. Locating these compute-intensive areas within our program lets us focus on the areas that give us the most benefit from tuning. How do we find these areas? The best way to find them is to profile your program.

Profile your program using Xprofiler

When we profile our program, we need to compile it with the -pg flag to generate profiling data. Note that the -O2 flag is a capital letter O followed by the number 2:

$ xlf -pg -O2 naive.f -o naive

The -pg flag compiles and links the executable so that when we run the program, the performance data gets written to output.

Now that we've compiled our program with the -pg flag, let's run it again to see what we get:

$ naive

This generates a file called gmon.out in the current working directory. We can look at the contents of gmon.out with the Xprofiler profiling tool. To start Xprofiler, we'll use the xprofiler command, like this:

$ xprofiler naive gmon.out

The Xprofiler main window appears, and in this window you'll see the function call tree. The function call tree is a graphical representation of the functions within your application and their inter-relationships. Each function is represented by a green, solid-filled box called a function box. In simple terms, the larger this box, the greater percentage of the total running time it consumes. So, the largest box represents the function doing the most work. The calls between functions are represented by blue arrows drawn between them call arcs. The arrowhead of the call arc points to the function that is being called. The function boxes and call arcs that belong to each library in your application appear within a fenced-in area called a cluster box. For the purposes of this section, we'll remove the cluster boxes from the display.

For more information on Xprofiler, see IBM Parallel Environment for AIX: Operation and Use, Vol. 2.

PLACE
the mouse cursor over the Filter menu.

CLICK
the left mouse button

The Filter menu appears.

SELECT
the Hide All Library Calls option.

The library calls disappear from the function call tree.

PLACE
the mouse cursor over the Filter menu.

CLICK
the left mouse button.

The Filter menu appears.

SELECT
the Uncluster Functions option.

The functions expand to fill the screen.

Locate the largest function box in the function call tree. We can get the name of the function by looking a little more closely at it:

PLACE
the mouse cursor over the View menu.

The View menu appears.

PLACE
the mouse cursor over the Overview option.

CLICK
the left mouse button.

The Overview Window appears.

Figure 3. Overview Window

View figure.

The Overview Window includes a light blue highlight area that lets you zoom in and out of specific areas of the function call tree. To take a closer look at the largest function of naive:

PLACE
the mouse cursor over the lower left corner of the blue highlight area. You know that your cursor is over the corner when the cursor icon changes to a right angle with an arrow pointing into it.

PRESS and HOLD
the left mouse button, and drag it diagonally upward and to the right (toward the center of the sizing box) to shrink the box. When it's about half its original size, release the mouse button.

The corresponding area of the function call tree, in the main window, appears magnified.

If the largest function wasn't within the highlight area, it didn't get magnified. If this was the case, you'll need to move the highlight area:

PLACE
the cursor over the highlighted area.

PRESS and HOLD
the left mouse button.

DRAG
the highlight area, using the mouse, and place it over the largest function. Release the mouse button.

The largest function appears magnified in the function call tree.

Just below the function is its name, so we can now see that most of the work is being done in the compute_stencil() subroutine. This subroutine is where we should focus our attention.

It's important to note that the programming style you choose can influence your program's performance just as much as the algorithm you use. In some cases, this will be clear by looking at the data you collect when your program executes. In other cases, you will know this from experience. There are many books that cover the subject of code optimization, many of which are extremely complex. See Bibliography.

As far as we're concerned, the goal here is not to use every optimization trick that's ever been thought up. Instead, we should focus on some basic techniques that can produce the biggest performance boost for the time and effort we spend.

Profile your program using the Performance Collection Tool

The best way to begin is to look at our use of memory (including hardware data cache) as well as what we're doing in the critical section of our code. To do this, we are going to use the Performance Collection Tool to count the number of cache misses. We know that the fewer the number of cache misses, the better the performance of our code will be.

When we profile our program using PCT, we need to compile it with the required -g flag to generate profiling data. You can also include the optional -o flag to specify an output file:

$ xlf -g -o naive naive.f

Once we have generated the profiling data, we can use PCT to examine the data in more detail.

TYPE
pct to start up the Performance Collection Tool graphical user interface. From the main window, you are prompted to either load and start an application or connect to one that is already running.

SELECT
the Load a new application option and click on OK.

The Load Application window opens and you are prompted to select the application you want to load.

Figure 4. Load Application Window

View figure.

CLICK
the Browse button next to the Executable Name field and select the naive program and identify it as a serial application.

CLICK
the Load button to load the application.

The Probe Data Selection window opens.

Figure 5. Probe Data Selection window

View figure.

SELECT
the type of data you want to collect. In our case, we want to select the Hardware and operating system profiles option.

SPECIFY
the directory and basename for your output file and click OK. Note that the basename you specify will have a ".cdf" suffix and a task number suffix appended to it.

The main window comes to the foreground and the source tree for the naive executable is expanded.

Figure 6. Source Tree Window

View figure.

SELECT
the naive task from the Process List.

SELECT
the naive_f function to expand it.

SELECT
the compute_stencil() subroutine from the naive.f file in the source tree.

SELECT
the hardware counter probe to collect cache information. You will want to select the L1 option to display level one information. For example, the option you select may look like:
2 L1_TLB

Figure 7. Process List, Source Tree, and Probe Selection Window

View figure.

CLICK
the Add button. If you look at the compute_stencil() subroutine in the source tree, you will see that a Probe id has been added.

SELECT
Application > Start from the menu bar to run the program.

When the application program has finished executing, the Target Application Exited window appears. Click on the OK button to exit PCT.

Profile your program using the Profile Visualization Tool

Now that we have collected our data on cache misses, we want to be able to view it and we can do that using the Profile Visualization Tool (PVT). PCT generates a NetCDF file (Network Common Data File) which we can view using PVT.

TYPE
pvt to start up the Profile Visualization Tool.

SELECT
File > Load from the menu bar to select and load the CDF file. Locate the CDF file that was generated from PCT from the list of files that appears and select it.

CLICK
the Open button to load the file.

SELECT
View > Expand All to expand the tree to view the function compute_stencil()

Figure 8. Data View Area

View figure.

CLICK
the Function Call Count option in the pulldown menu located in the top right side of the Data View area. Select the Data cache miss option to view the number of cache misses for the function compute_stencil. The amount of L1 cache misses for each function are listed in the Data View window area.

So..... let's look at our code:

iter_count = 0
100  CONTINUE
local_err = 0.0
iter_count = iter_count + 1
 
DO i=1, m-2
DO j=1, n-2
old_value = stencil(i,j)
 
stencil(i,j) = ( stencil(i-1, j ) +
1                          stencil(i+1, j ) +
2                          stencil( i ,j-1) +
3                          stencil( i ,j+1) ) / 4
local_err = MAX(local_err,ABS(old_value-stencil(i,j)))
END DO
END DO
IF(MOD(iter_count,100).EQ.0)PRINT *, iter_count, local_err
IF (close_enough.LT.local_err) GOTO 100
PRINT *, "convergence reached after ", iter_count, " iterations."

By looking at the two DO loops above, we can see that our compute subroutine is traversing our array first across rows, and then down columns. This program must have been written by some alien being from the planet C because Fortran arrays are stored in column major form rather than row major form.

The first improvement we should make is to reorder our loops so that they traverse down columns rather than across rows. This should provide a reasonable performance boost. Note that it's not always possible to change the order of loops; it depends on the data referenced within the loop body. As long as the values used in every loop iteration don't change when the loops are reordered, then it's safe to change their order. In the example we just looked at, it was safe to reorder the loops, so here's what the revised program looks like. Notice that all we did was swap the order of the loops.

 DO j=1, n-2
DO i=1, m-2
old_value = stencil(i,j)
 

The second thing we should look at is the type of work that's being done in our loop. If we look carefully, we'll notice that the MAX and ABS subroutines are called in each iteration of the loop, so we should make sure these subroutines are compiled inline. Because these subroutines are intrinsic to our Fortran compiler, this is already done for us.

$ xlf -O2 reordered.f -o reordered

In the last scenario, we ran the naive program. You should now run the same scenario using the reordered program to more accurately compare the cache misses. You should see that the number of cache misses for reordered has decreased, thereby increasing the program's efficiency.

If we run the previous scenario again using the reordered subroutine, we notice that the cache misses are lower:

Figure 9. Data View Area (fewer cache misses showing)

View figure.

As before, we need to time our run, like this:

$ time reordered

And here are the results as compared to the original naive version:

Program Name Tasks Wallclock Time Array Size per Task
naive 1 (single processor) 11m1.94s 1000x1000
reordered 1 (single processor) 5m35.38s 1000x1000

As you can see by the results, with just a small amount of analysis, we doubled performance. And we haven't even considered parallelism yet. However, this still isn't the performance that we want, especially for very large arrays (the CPU time is good, but the elapsed time is not).

Parallelize it

Now that we feel confident that our serial program is reasonably efficient, we should look at ways to parallelize it. As we discussed in Chapter 2, The answer is 42, there are many ways to parallelize a program, but the two most commonly used techniques are functional decomposition and data decomposition. We'll focus on data decomposition only.

OK, so how do I decompose my data? Let's start by dividing the work across the processors. Each task will compute a section of an array, and each program will solve 1/n of the problem when using n processors.

Here's the algorithm we'll use:

The section of code for our algorithm looks like this:

      iter_count = 0
 100  CONTINUE
         local_err = 0.0
         iter_count = iter_count + 1
         CALL exchange(stencil, m, n)
 
         DO j=1, n-2
            DO i=1, m-2
               old_value = stencil(i,j)
 
               stencil(i,j) = ( stencil(i-1, j ) +
     1                          stencil(i+1, j ) +
     2                          stencil( i ,j-1) +
     3                          stencil( i ,j+1) ) / 4
 
               local_err = MAX(local_err,ABS(old_value-stencil(i,j)))
           END DO
        END DO
        CALL MPI_Allreduce(local_err, global_error, 1, MPI_Real,
     1      MPI_Max, MPI_Comm_world, ierror)
 
        IF(MOD(iter_count,100).EQ.0)PRINT *, iter_count, global_error
      IF (close_enough.LT.global_error) GOTO 100
      PRINT *, "convergence reached after", iter_count, "iterations."

Now, let's compile our parallelized version:

$ mpxlf -02 chaotic.f -o chaotic

Next, let's run it and look at the results:

$ export MP_PROCS=4
$ export MP_LABELIO=yes
$ time poe chaotic

Program Name Tasks Wallclock Time Array Size per Task
naive 1 (single processor) 11m1.94s 1000x1000
reordered 1 (single processor) 5m35.38s 1000x1000
chaotic 4 (processors) 2m4.58s 500x500

The previous results show that we more than doubled performance by parallelizing our program. So... we're done, right? Wrong! Since we divided up the work between four processors, we expected our program to execute four times faster. Why doesn't it? This could be due to one of several factors that tend to influence overall performance:

We'll look at some of these factors later. Right now we need to be asking ourselves something more important; does the parallel program get the same answer?

The algorithm we chose gives us a correct answer, but as you will see, it doesn't give us the same answer as our serial version. In practical applications, this may be OK. In fact, it's very common for this to be acceptable in Gauss/Seidel chaotic relaxation. But what if it's not OK? How can we tell? What methods or tools can be used to help us diagnose the problem and find a solution?

Let's take a look...

Wrong answer!

We've now invested all this time and energy in parallelizing our program using message passing, so why can't we get the same answer as the serial version of the program? This is a problem that many people encounter when parallelizing applications from serial code and can be the result of algorithmic differences, program defects, or environment changes.

Both the serial and parallel versions of our program give correct answers based on the problem description, but that doesn't mean they both can't compute different answers! Let's examine the problem more closely by running the chaotic.f program under the pdbx debugger:

$ pdbx chaotic

By looking at the main program, we can see that both versions of our program (reorder.f and chaotic.f) read in the same data file as input. And after we initialize our parallel environment, we can see that the compute_stencil subroutine performs exactly the same step in order to average stencil cells.

Let's run each version under the control of the debugger to view and compare the results of our arrays.

With this test, we will be looking at the upper left quadrant of the entire array. This allows us to compare the array subset on task 0 of the parallel version with the same subset on the serial version.

Here's the serial (reordered) array and parallel (chaotic) array stencils:

Figure 10. Serial and Parallel Array Stencils

View figure.

In chaotic.f, set a breakpoint within the call compute_stencil at line 168.

pdbx(all) stop at 168
all:[0] stop at "chaotic.f":168

Note: After you do this, all tasks should have a breakpoint set at line at 168.

Continue to execute the program up to the breakpoints. The program counter should now be positioned at line 168.

+--------------------------------------------------------------------------------+
|pdbx(all) cont                                                                  |
|   0: initializing the array.                                                   |
|   0: computing the stencil.                                                    |
|   0: 100 1.397277832                                                           |
|   1: 100 1.397277832                                                           |
|   ...                                                                          |
|   ...                                                                          |
|   1:[6] stopped in compute_stencil at line 168 in file "chaotic.f" ($t1)       |
|   1:  168         PRINT *, "convergence reached after", iter_count, "iterations.|
|   2:[6] stopped in compute_stencil at line 168 in file "chaotic.f" ($t1)       |
|   2:  168         PRINT *, "convergence reached after", iter_count, "iterations.|
|   3:[6] stopped in compute_stencil at line 168 in file "chaotic.f" ($t1)       |
|   3:  168         PRINT *, "convergence reached after", iter_count, "iterations.|
|   0:[6] stopped in compute_stencil at line 168 in file "chaotic.f" ($t1)       |
|   0:  168         PRINT *, "convergence reached after", iter_count, "iterations.|
|                                                                                |
+--------------------------------------------------------------------------------+

Next, we'll need to examine the array stencil. Switch the context to task 0, then print the 499th row of the array:

+--------------------------------------------------------------------------------+
|pdbx print stencil(499,1..10)                                                   |
|                                                                                |
|0:(499,1) = 8.00365734                                                          |
|0:(499,2) = 15.9983482                                                          |
|0:(499,3) = 23.9780369                                                          |
|0:(499,4) = 31.9367294                                                          |
|0:(499,5) = 39.8684845                                                          |
|0:(499,6) = 47.7674294                                                          |
|0:(499,7) = 55.6277695                                                          |
|0:(499,8) = 63.4438095                                                          |
|0:(499,9) = 71.2099609                                                          |
|0:(499,10) = 78.9207458      ...                                                |
+--------------------------------------------------------------------------------+

Let's take a close look at the data of each.

Here's the reordered data:

(row, col)
 (499,1)    (499,2)      (499,3)      (499,4)      (499,5)      (499,6)
 8.00365734  15.9983482   23.9780369   31.9367294   39.8684845   47.7674294
 (499,7)    (499,8)      (499,9)      (499,10)     
  55.6277695 63.4438095   71.2099609   78.9207458

Here's the chaotic data:

(row, col)
(499,1)     (499,2)      (499,3)      (499,4)      (499,5)      (499,6)
 8.04555225  16.0820065   24.1032257   32.1031151   40.0756378   48.0148277
(499,7)     (499,8)      (499,9)      (499,10)
 55.9147987  63.7697601   71.5740356   79.3220673
 

After looking at the data, we see that our answers are definitely similar, but different. Why? We can blame it on a couple of things, but it's mostly due to the chaotic nature of our algorithm. By looking at how the average is computed in the serial version of our program, we can see that within each iteration of our loop, two array cells are from the old iteration and two are from new ones.

Figure 11. How the average is computed in a 4-point stencil

View figure.

Another factor is that the north and west borders contain old values at the beginning of each new sweep for all tasks except the northwest corner. The serial version would use new values in each of those quadrants instead of old values. In the parallel version of our program, this is true for the interior array cells but not for our shared boundaries. For more information, you may find In Search of Clusters by Gregory F. Phister, helpful. See Non-IBM publications.

OK, now that we know why we get different answers, is there a fix?

Here's the fix!

So, you've got a serial and parallel program that don't give you the same answers. One way to fix this is to skew the processing of the global array. We skew the processing of the array, computing the upper left process coordinate first, then each successive diagonal to the lower right process coordinate. Each process sends the east and south boundary to its neighboring task.

Figure 12. Sequence of Array Calculation

View figure.

The only thing we need to modify in our new program is the message passing sequence. Prior to the compute_stencil() subroutine, each task receives boundary cells from its north and west neighbors. Each task then sends its east and south boundary cells to its neighbor. This guarantees that the array cells are averaged in the same order as in our serial version.

Here's our modified (skewed) parallel program. It's called skewed.f.

      iter_count = 0
 100  CONTINUE
         local_err = 0.0
         iter_count = iter_count + 1
         CALL exch_in(stencil, m, n)
 
         DO j=1, n-2
            DO i=1, m-2
               old_value = stencil(i,j)
 
               stencil(i,j) = ( stencil(i-1, j ) +
     1                          stencil(i+1, j ) +
     2                          stencil( i ,j-1) +
     3                          stencil( i ,j+1) ) / 4
 
               local_err = MAX(local_err,ABS(old_value-stencil(i,j)))
            END DO
         END DO
 
         CALL exch_out(stencil, m, n)
         CALL MPI_Allreduce(local_err, global_error, 1, MPI_Real,
     1      MPI_Max, MPI_Comm_world, ierror)
 
         IF(MOD(iter_count,100).EQ.0)PRINT *, iter_count, global_error
      IF (close_enough.LT.global_error) GOTO 100
      PRINT *, "convergence reached after", iter_count, "iterations."

Now let's run this new version and look at the results:

$ time poe skewed

Program Name Tasks Wallclock Time Array Size per Task
naive 1 (single processor) 11m1.94s 1000x1000
reordered 1 (single processor) 5m35.38s 1000x1000
chaotic 4 (processors) 2m4.58s 500x500
skewed 4 (processors) 4m41.87s 500x500

If we do the same array comparison again, we can see that we do indeed get the same results. But, of course, nothing's that easy. By correcting the differences in answers, we slowed down execution significantly, so the hidden cost here is time. Hmm... now what do we do?

It's still not fast enough!

We've got the right answers now, but we still want our program to move faster, so we're not done yet. Let's look at our new code to see what other techniques we can use to speed up execution. We'll look at:

One way to further analyze our program is to use the Argonne National Laboratory's Jumpshot tool. Using the PE Benchmarker slogmerge utility, we can generate a SLOG file which we can then load into Jumpshot and use to determine how we can get our program to run faster. We are going to use Jumpshot to determine the effectiveness of the program's message passing characteristics.

Before analyzing your program using Jumpshot, you must link the program with the library which creates the MPI trace files used in the analysis. You do this by setting the MP_UTE environment variable to yes before compiling the program. Assuming you are using ksh, issue the command export MP_UTE=yes before compiling the program. Once you set the environment variable, it remains set for the duration of the login session.

TYPE
pct to start up the Performance Collection Tool graphical user interface. From the Welcome window, you are prompted to either load and start an application or to connect to one that is already running.

SELECT
the Load a new application option and click on OK.

The Load Application window opens and you are prompted to select the application you want to load.

CLICK
the Browse button next to the Executable Name field and select the chaotic program and identify it as an SPMD application.

TYPE
the POE arguments in the POE Arguments field. For example, the following argument specifies that you are running a 4-way parallel job:
-procs 4

CLICK
the Load button to load the application.

The Probe Data Selection window opens.

SELECT
the type of data you want to collect. In our case, we want to select the MPI and user event traces option.

SPECIFY
the directory and basename for your output file. In this scenario, we are using the basename mytrace. Then click on OK.

The main window appears again with the source tree for the skewed executable expanded.

SELECT
Process > Select All Tasks

SELECT
the chaotic.f () subroutine from the source tree.

SELECT
the All MPI events to collect trace information from the Probe Selection area on the side of the main window.

CLICK
the Add button.

SELECT
Application > Start from the menu bar to run the program.

When the application program has finished executing, the Target Application Exited window appears. Click on the OK button to exit PCT.

We have successfully collected data on message passing that now exists in a standard AIX trace file. To view and analyze the data using Jumpshot, we first need to convert the AIX trace file, using the uteconvert utility, into UTE (Unified Trace Environment) interval files. (Jumpshot is a public domain tool developed by Argonne National Laboratories and is not part of the PE Benchmarker Toolset). Then, we can convert the UTE interval files into SLOG files using the slogmerge utility, which Jumpshot can display.

TYPE
uteconvert mytrace

where mytrace is the name of the trace file located in the current directory. mytrace is the prefix of the filename of the trace file. For example, if you had three tasks, the tracefiles would be named mytrace0, mytrace1, and mytrace2. This trace file has the same name as the file you specified for your output earlier in our example. This command will convert the trace file from AIX trace format into the UTE interval file.

Using the -o flag, you can optionally specify the name of the output UTE interval file. For example, to specify that the output file should be named outputfile, type:

uteconvert -o outputfile mytrace

To convert a set of AIX trace files into a set of UTE interval files, specify the number of files using the -n option, and supply the common "base name" prefix shared by all of the files. For example, to convert five trace files with the prefix mytraces into UTE interval files, copy the trace files into a common directory and enter:

uteconvert -n 5 mytraces

TYPE
slogmerge mytrace.ute

where mytrace is the name of your UTE interval file. By default, the slogmerge file output will be trcfile.slog. Using the - o option on the slogmerge command, however, you can specify an output file name. For example:

slogmerge -o mergedtrc.slog mytrace.ute

If you have multiple interval files, use -n to specify the number of files.

TYPE
jumpshot to display the Jumpshot graphical user interface. (We assume that you have already downloaded the Jumpshot program available from Argonne National Laboratories).

SELECT
File > Select Logfile from the menu bar to load the SLOG file.

The window that appears displays the events of your program across a time line. To see detailed load balancing information, continue on with the next step.

SELECT
MPI-Process from the View Options section of the main window to look at the MPI task view and the load imbalance. Then, click on the Display button.

The following picture illustrates the MPI functions occurring during the execution of the skewed program. Each box shown represents an MPI function and the arrows and lines represent communications calls between or within the functions.

Figure 13. Jumpshot

View figure.

By looking at the message passing, we can see some peculiar characteristics of our program. For instance, we notice that many of the processors waste time by waiting for others to complete before they continue. These kinds of characteristics lead us to the conclusion that we've introduced very poor load balancing across tasks. One way to alleviate this problem is to allow some processors to work ahead if they can deduce that another iteration will be necessary in order to find a solution.

If a task's individual max is large enough on one iteration to force the global max to reiterate across the entire array, that task may continue on the next iteration when its west and north boundaries are received.

To illustrate this, we'll use the pipelined.f program.

      iter_count = 0
      local_err = close_enough + 1
 100  CONTINUE
         iter_count = iter_count + 1
         CALL exch_in(stencil, m, n, local_err, global_err,
     1      iter_count, close_enough)
 
         IF (MAX(global_err,local_err).GE.close_enough) THEN
            local_err = 0.0
            DO j=1, n-2
               DO i=1, m-2
                  old_val = stencil(i,j)
 
                  stencil(i,j) = ( stencil( i-1, j ) +
     1                             stencil( i+1, j ) +
     2                             stencil( i ,j-1) +
     3                             stencil( i ,j+1) ) / 4
 
                  local_err = MAX(local_err, ABS(old_val-stencil(i,j)))
               END DO
            END DO
         END IF
 
         CALL exch_out(stencil, m, n, global_err, local_err)
 
         IF(MOD(iter_count,100).EQ.0)PRINT *, iter_count, global_err
     IF (MAX(global_err,local_err).GE.close_enough) GOTO 100
     PRINT *, "convergence reached after", iter_count, "iterations."

As you can see on the following line:

IF(MAX(global_err,local_err).GE.close_enough) THEN

the program checks to see if the value of local_err is enough to allow this task to continue on the next iteration.

Now that we've made these improvements to our program, we'll most likely see improvement in our load balance as well.

Now, let's run our new code to see how this new version fares.

$ time poe pipelined


Program Name Tasks Wallclock Time Array Size per Task
naive 1 (single processor) 11m1.94s 1000x1000
reordered 1 (single processor) 5m35.38ss 1000x1000
chaotic 4 (processors) 2m4.58s 500x500
skewed 4 (processors) 4m41.87s 500x500
pipelined 4 (processors) 2m7.42s 500x500

As you can see, we were able to significantly improve the performance of our program and, at the same time, get a consistent, correct answer. And all we had when we started was our serial program!

We can further analyze the pipelined program's load balance using Jumpshot. The following picture illustrates that the load balance has improved in the pipelined program:

Figure 14. Jumpshot (showing improved load balance)

View figure.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]