IBM Books

Hitchhiker's Guide


The program runs but...

Once you've gotten the parallel application running, it would be nice if you were guaranteed that it would run correctly. Unfortunately, this is not the case. In some cases, you may get no output at all, and your challenge is to figure out why not. In other cases, you may get output that's just not correct and, again, you must figure out why it isn't.

The parallel debugger is your friend

An important tool in analyzing your parallel program is the PE parallel debugger (pdbx). In some situations, using the parallel debugger is just like using a debugger for a serial program. In other situations, however, the parallel nature of the problem introduces some subtle and not-so-subtle differences which you should understand in order to use the debugger efficiently. While debugging a serial application, you can focus your attention on the single problem area. In a parallel application, you have to shift your attention between the various parallel tasks and also consider how the interaction among the tasks may be affecting the problem.

The simplest problem

The simplest parallel program to debug is one where all the problems exist in a single task. In this case, you can unhook all the other tasks from the debugger's control and use the parallel debugger as if it were a serial debugger. However, in addition to being the simplest case, it is also the most rare.

The next simplest problem

The next simplest case is one where all the tasks are doing the same thing and they all experience the problem that is being investigated. In this case, you can apply the same debug commands to all the tasks, advance them in lockstep and interrogate the state of each task before proceeding. In this situation, you need to be sure to avoid debugging-introduced deadlocks. These are situations where the debugger is trying to single-step a task past a blocking communication call, but the debugger has not stepped the sender of the message past the point where the message is sent. In these cases, control will not be returned to the debugger until the message is received, but the message will not be sent until control returns to the debugger. Get the picture?

OK, the worst problem

The most difficult situation to debug, and also the most common, is where not all the tasks are doing the same thing and the problem spans two or more tasks. In these situations, you have to be aware of the state of each task, and the interrelations among tasks. You must ensure that blocking communication events either have been or will be satisfied before stepping or continuing through them. This means that the debugger has already executed the send for blocking receives, or the send will occur at the same time (as observed by the debugger) as the receive. Frequently, you may find that tracing back from an error state leads to a message from a task to which you were not paying attention. In these situations, your only choice may be to run the application again and focus on the events leading up to the send.

It core dumps

If your program creates a core dump, POE saves a copy of the

core file so you can debug it later. Unless you specify otherwise, POE saves the core file in the coredir.taskid directory, under the current working directory, where taskid is the task number. For example, if your current directory is /u/mickey, and your application creates a core dump (segmentation fault) while running on the node that is task 4, the core file will be located in /u/mickey/coredir.4 on that node.

You can control where POE saves the core file by using the -coredir POE command line option or the MP_COREDIR environment variable.

Standard AIX corefiles can be large and often the information in the files appears at a very low level. This can make the files difficult to debug. These large files can also consume too much disk space, CPU time, and network bandwidth. To avoid this problem, PE allows you to produce corefiles in the Ptools Lightweight Corefile Format. Lightweight corefiles provide simple shared stack traces (listings of function calls that led to the error), and consume less system resources than traditional corefiles. For more information on lightweight corefiles and how to generate them, see IBM Parallel Environment for AIX: Operation and Use, Vol. 1.

Debugging core dumps

There are two ways you can use traditional AIX core dumps to find problems in your program. After running the program, you can examine the resulting core file to see if you can find the problem. Or, you can try to view your program state by catching it at the point where the problem occurs.

Examining core files

Before you can debug a core file, you first need to get one. In our case, let's just generate it. The example we'll use is an MPI program in which even-numbered tasks pass the answer to the meaning of life to odd-numbered tasks. It's called bad_life.c, and here's what it looks like:

/*******************************************************************
*
* bad_life program
 
* To compile:
* mpcc -g -o bad_life bad_life.c
*
*******************************************************************/
 
#include <stdio.h>
#include <mpi.h>
 
void main(int argc, char *argv[])
{
        int  taskid;
        MPI_Status  stat;
 
        /* Find out number of tasks/nodes. */
        MPI_Init( &argc, &argv);
        MPI_Comm_rank( MPI_COMM_WORLD, &taskid);
 
        if ( (taskid % 2) == 0)
        {
                char *send_message = NULL;
 
                send_message = (char *) malloc(10);
                strcpy(send_message, "Forty Two");
                MPI_Send(send_message, 10, MPI_CHAR, taskid+1, 0,
                        MPI_COMM_WORLD);
                free(send_message);
        } else
        {
                char *recv_message = NULL;
 
                MPI_Recv(recv_message, 10, MPI_CHAR, taskid-1, 0,
                MPI_COMM_WORLD, &stat);
                printf("The answer is  %s\n", recv_message);
                free(recv_message);
        }
                printf("Task %d complete.\n",taskid);
                MPI_Finalize();
                exit(0);
}

We compiled bad_life.c with the following parameters:

$ mpcc -g bad_life.c -o bad_life

and when we run it, we get the following results:

$ export MP_PROCS=4
$ export MP_LABELIO=yes
$ bad_life
  0:Task 0 complete.
  2:Task 2 complete.
ERROR: 0031-250  task 1: Segmentation fault
ERROR: 0031-250  task 3: Segmentation fault
ERROR: 0031-250  task 0: Terminated
ERROR: 0031-250  task 2: Terminated

As you can see, bad_life.c gets two segmentation faults which generate two core files. If we list our current directory, we can indeed see two core files; one for task 1 and the other for task 3.

$ ls -lR core*
total 88
-rwxr-xr-x   1 hoov     staff       8472 May 02 09:14 bad_life
-rw-r--r--   1 hoov     staff        928 May 02 09:13 bad_life.c
drwxr-xr-x   2 hoov     staff        512 May 02 09:01 coredir.1
drwxr-xr-x   2 hoov     staff        512 May 02 09:36 coredir.3
-rwxr-xr-x   1 hoov     staff       8400 May 02 09:14 good_life
-rw-r--r--   1 hoov     staff        912 May 02 09:13 good_life.c
-rw-r--r--   1 hoov     staff         72 May 02 08:57 host.list
./coredir.1:
total 48
-rw-r--r--   1 hoov     staff      24427 May 02 09:36 core
 
./coredir.3:
total 48
-rw-r--r--   1 hoov     staff      24427 May 02 09:36 core

So, what do we do now? Let's run dbx on one of the core files to see if we can find the problem. You run dbx like this:

$ dbx bad_life coredir.1/core
 
Type 'help' for help.
reading symbolic information ...
[using memory image in coredir.1/core]
 
Segmentation fault in moveeq.memcpy [/usr/lpp/ppe.poe/lib/ip/libmpci.a] at 0xd055
b320
0xd055b320 (memcpy+0x10) 7ca01d2a       stsx   r5,r0,r3
(dbx)

Now, let's see where the program crashed and what its state was at that time. If we issue the where command,

(dbx) where

we can see the program stack:

moveeq._moveeq() at 0xd055b320
fmemcpy() at 0xd0568900
cpfromdev() at 0xd056791c
readdatafrompipe(??, ??, ??) at 0xd0558c08
readfrompipe() at 0xd0562564
finishread(??) at 0xd05571bc
kickpipes() at 0xd0556e64
mpci_recv() at 0xd05662cc
_mpi_recv() at 0xd050635c
MPI__Recv() at 0xd0504fe8
main(argc = 1, argv = 0x2ff22c08), line 32 in "bad_life.c"
(dbx)

The output of the where command shows that bad_life.c failed at line 32, so let's look at line 32, like this:

(dbx) func main
(dbx) list 32
 
    32          MPI_Recv(recv_message, 10, MPI_CHAR, taskid-1, 0,
                         MPI_COMM_WORLD, &stat);

When we look at line 32 of bad_life.c, our first guess is that one of the parameters being passed into MPI_Recv is bad. Let's look at some of these parameters to see if we can find the source of the error:

(dbx) print recv_message
(nil)

Ah ha! Our receive buffer has not been initialized and is NULL. The sample programs for this book include a solution called good_life.c. See Accessing PE documentation online for information on how to get the sample programs.

It's important to note that we compiled bad_life.c with the -g compile flag. This gives us all the debugging information we need in order to view the entire program state and to print program variables. In many cases, people don't compile their programs with the -g flag, and they may even turn optimization on (-O). When they do this, there's virtually no information to tell them what happened when their program executed. If this is the case, you can still use dbx to look at only stack information, which allows you to determine the function or subroutine that generated the core dump.

Viewing the program state

If collecting core files is impractical, you can also try catching the program at the segmentation fault. You do this by running the program under the control of the debugger. The debugger gets control of the application at the point of the segmentation fault, and this allows you to view your program state at the point where the problem occurs.

In the following example, we'll use bad_life again, but we'll use pdbx instead of dbx. Load bad_life under pdbx with the following command:

$ pdbx bad_life
 
pdbx Version 3.2  -- Apr 30 2001 15:56:32
 
  0:reading symbolic information ...
  1:reading symbolic information ...
  2:reading symbolic information ...
  3:reading symbolic information ...
  1:[1] stopped in main at line 12
  1:   12       char            *send_message = NULL;
  0:[1] stopped in main at line 12
  0:   12       char            *send_message = NULL;
  3:[1] stopped in main at line 12
  3:   12       char            *send_message = NULL;
  2:[1] stopped in main at line 12
  2:   12       char            *send_message = NULL;
0031-504  Partition loaded ...

Next, let the program run to allow it to reach a segmentation fault.

pdbx(all) cont
 
  0:Task 0 complete.
  2:Task 2 complete.
  3:
  3:Segmentation fault in @moveeq._moveeq [/usr/lpp/ppe.poe/lib/ip/libmpci.]a
  at 0xd036c320
  3:0xd036c320 (memmove+0x10) 7ca01d2a       stsx   r5,r0,r3
  1:
  1:Segmentation fault in @moveeq._moveeq [/usr/lpp/ppe.poe/lib/ip/libmpci.a]
  at 0xd055b320
  1:0xd055b320 (memcpy+0x10) 7ca01d2a       stsx   r5,r0,r3

Once we get segmentation faults, we can focus our attention on one of the tasks that failed. Let's look at task 1:

pdbx(all) on 1

By using the pdbx where command, we can see where the problem originated in our source code:

pdbx(1) where
 
  1:@moveeq.memcpy() at 0xd055b320
  1:fmemcpy() at 0xd0568900
  1:cpfromdev() at 0xd056791c
  1:readdatafrompipe(??, ??, ??) at 0xd0558c08
  1:readfrompipe() at 0xd0562564
  1:finishread(??) at 0xd05571bc
  1:kickpipes() at 0xd0556e50
  1:mpci_recv() at 0xd05662fc
  1:_mpi_recv() at 0xd050635c
  1:MPI__Recv() at 0xd0504fe8
  1:main(argc = 1, argv = 0x2ff22bf0), line 32 in "bad_life.c"

Now, let's move up the stack to function main:

pdbx(1) func main

Next, we'll list line 32, which is where the problem is located:

pdbx(1) l 32
 
  1:   32        MPI_Recv(recv_message, 10, MPI_CHAR, taskid-1, 0,
                 MPI_COMM_WORLD, &stat);

Now that we're at line 32, we'll print the value of recv_message:

 pdbx(1) p recv_message
 
  1:(nil)

As we can see, our program passes a bad parameter to MPI_RECV().

Both the techniques we've talked about so far help you find the location of the problem in your code. The example we used makes it look easy, but in many cases it won't be so simple. However, knowing where the problem occurred is valuable information if you're forced to debug the problem interactively. So, it's worth the time and trouble to figure it out.

Core dumps and threaded programs

If a task of a threaded program produces a core file, the partial dump produced by default does not contain the stack and status information for all threads. Therefore, it is of limited usefulness. You can request AIX to produce a full core file, but such files are generally larger than permitted by user limits (the communication subsystem alone generates more than 64 MB of core information). As a result, you should consider two alternatives:

No output at all

Should there be output?

If you're not getting output from your program and you think you ought to be, make sure you have enabled the program to send data back to you. If the MP_STDOUTMODE environment

variable is set to a number, it is the number of the only task for which standard output will be displayed. If that task does not generate standard output, you won't see any.

There should be output

If MP_STDOUTMODE is set appropriately,

the next step is to verify that the program is actually doing something. Start by observing how the program terminates (or fails to terminate). It will do one of the following things:

In the first case, you should examine any messages you receive. Since your program is not generating any output, all of the messages will be coming from POE.

In the second case, you will have to stop the program yourself (<Ctrl-c> should work).

One possible reason for lack of output could be that your program is terminating abnormally before it can generate any. POE will report abnormal termination conditions such as being killed, as well as non-zero return codes. Sometimes these messages are obscured in the blur of other errata, so it's important to check the messages carefully.

Figuring out return codes

It's important to understand POE's interpretation of return codes. If the exit code for a task is zero(0) or in the range of 2 to 127, then POE will make that task wait until all tasks have exited. If the exit code is 1 or greater than 128 (or less than 0), then POE will terminate the entire parallel job abruptly (with a SIGTERM signal to each task). In normal program execution, one would expect to have each program go through exit(0) or STOP, and exit with an exit code of 0. However, if a task encounters an error condition (for example, a full file system), then it may exit unexpectedly. In these cases, the exit code is usually set to -1. If, however, you have written error handlers which produce exit codes other than 1 or -1, then POE's termination algorithm may cause your program to hang because one task has terminated abnormally, while the other tasks continue processing (expecting the terminated task to participate).

If the POE messages indicate the job was killed (either because of some external situation like low page space or because of POE's interpretation of the return codes), it may be enough information to fix the problem. Otherwise, you may have to do more analysis.

It hangs

If you've gotten this far and the POE messages, and the additional checking by the message passing routines, haven't shed any light on why your program is not generating output, the next step is to figure out whether your program is doing anything at all (besides not giving you output).

Let's look at the following example...it's got a bug in it.

/************************************************************************
*
* Ray trace program with bug
*
* To compile:
* mpcc -g -o rtrace_bug rtrace_bug.c
*
*
* Description:
* This is a sample program that partitions N tasks into
* two groups, a collect node and N - 1 compute nodes.
* The responsibility of the collect node is to collect the data
* generated by the compute nodes. The compute nodes send the
* results of their work to the collect node for collection.
*
* There is a bug in this code.  Please do not fix it in this file!
*
************************************************************************/
 
#include <mpi.h>
 
#define PIXEL_WIDTH 50
#define PIXEL_HEIGHT 50
 
int First_Line = 0;
int Last_Line  = 0;
 
void main(int argc, char *argv[])
{
  int numtask;
  int taskid;
 
  /* Find out number of tasks/nodes. */
  MPI_Init( &argc, &argv);
  MPI_Comm_size( MPI_COMM_WORLD, &numtask);
  MPI_Comm_rank( MPI_COMM_WORLD, &taskid);
 
  /* Task 0 is the coordinator and collects the processed pixels */
  /* All the other tasks process the pixels                      */
  if ( taskid == 0 )
    collect_pixels(taskid, numtask);
  else
    compute_pixels(taskid, numtask);
 
  printf("Task %d waiting to complete.\n", taskid);
  /* Wait for everybody to complete */
  MPI_Barrier(MPI_COMM_WORLD);
  printf("Task %d complete.\n",taskid);
  MPI_Finalize();
  exit();
}
 
/* In a real implementation, this routine would process the pixel */
/* in some manner and send back the processed pixel along with its*/
/* location.  Since we're not processing the pixel, all we do is  */
/* send back the location                                         */
compute_pixels(int taskid, int numtask)
{
  int  section;
  int  row, col;
  int  pixel_data[2];
  MPI_Status stat;
 
  printf("Compute #%d: checking in\n", taskid);
 
  section = PIXEL_HEIGHT / (numtask -1);
 
  First_Line = (taskid - 1) * section;
  Last_Line  = taskid * section;
 
  for (row = First_Line; row < Last_Line; row ++)
    for ( col = 0; col < PIXEL_WIDTH; col ++)
      {
         pixel_data[0] = row;
         pixel_data[1] = col;
         MPI_Send(pixel_data, 2, MPI_INT, 0, 0, MPI_COMM_WORLD);
      }
  printf("Compute #%d: done sending. ", taskid);
  return;
}
 
/* This routine collects the pixels.  In a real implementation, */
/* after receiving the pixel data, the routine would look at the*/
/* location information that came back with the pixel and move  */
/* the pixel into the appropriate place in the working buffer   */
/* Since we aren't doing anything with the pixel data, we don't */
/* bother and each message overwrites the previous one          */
collect_pixels(int taskid, int numtask)
{
  int  pixel_data[2];
  MPI_Status stat;
  int      mx = PIXEL_HEIGHT * PIXEL_WIDTH;
 
  printf("Control #%d: No. of nodes used is %d\n", taskid,numtask);
  printf("Control: expect to receive %d messages\n", mx);
 
  while (mx > 0)
    {
      MPI_Recv(pixel_data, 2, MPI_INT, MPI_ANY_SOURCE,
        MPI_ANY_TAG, MPI_COMM_WORLD, &stat);
      mx--;
    }
  printf("Control node #%d: done receiving. ",taskid);
  return;
}

We took this example from a ray tracing program that distributed a display buffer out to server nodes. The intent is that each task, other than Task 0, takes an equal number of full rows of the display buffer, processes the pixels in those rows, and then sends the updated pixel values back to the client. In the real application, the task would compute the new pixel value and send it as well, but in this example, we're just sending the row and column of the pixel. Because the client is getting the row and column location of each pixel in the message, it doesn't care which server each pixel comes from. The client is Task 0, and the servers are all the other tasks in the parallel job.

This example has a functional bug in it. With a little bit of analysis, the bug is probably easy to spot, and you may be tempted to fix it right away. PLEASE DO NOT!

When you run this program, you get the output shown below. Notice that we're using the -g option when we compile the example. We're cheating a little because we know that there's going to be a problem, so we're compiling with debug information that is turned on right away.

$ mpcc -g -o rtrace_bug rtrace_bug.c
$ rtrace_bug -procs 4 -labelio yes
  1:Compute #1: checking in
  0:Control #0: No. of nodes used is 4
  1:Compute #1: done sending. Task 1 waiting to complete.
  2:Compute #2: checking in
  3:Compute #3: checking in
  0:Control: expect to receive 2500 messages
  2:Compute #2: done sending. Task 2 waiting to complete.
  3:Compute #3: done sending. Task 3 waiting to complete.
^C
ERROR: 0031-250  task 1: Interrupt
ERROR: 0031-250  task 2: Interrupt
ERROR: 0031-250  task 3: Interrupt
ERROR: 0031-250  task 0: Interrupt

No matter how long you wait, the program will not terminate until you press <Ctrl-c>.

So, we suspect the program is hanging somewhere. We know it starts executing because we get some messages from it. It could be a logical hang or it could be a communication hang.

Hangs and threaded programs

Coordinating the threads in a task requires careful locking and signaling. Deadlocks that occur because the program is waiting on locks that haven't been released are common, in addition to the deadlock possibilities that arise from improper use of the MPI message passing calls.

Let's attach the debugger

So now that we've come to the conclusion that our program is hanging,

let's use the debugger to find out why. The best way to diagnose this problem is to attach the debugger directly to our POE job.

Start up POE and run rtrace_bug:

$ rtrace_bug -procs 4 -labelio yes

To attach the debugger, we first need to get the process id (pid) of the POE job, using the AIX ps command:

$ ps -ef | grep poe
 
smith 24152 20728   0 08:25:22  pts/0  0:00 poe

Next, we'll need to start the pdbx debugger in attach mode by using the -a flag and the process identifier (pid) of the POE job:

$ pdbx -a 24152

After starting the debugger in attach mode, a pdbx Attach screen appears.

+--------------------------------------------------------------------------------+
|$ pdbx -a 24152                                                                 |
|pdbx Version 3, Release 2 -- April 30 2001 16:46:48                             |
|                                                                                |
|                                                                                |
|To begin debugging in attach mode, select a task or tasks to attach.            |
|                                                                                |
|Task      IP Addr               Node                        PID      Program    |
|0       9.117.243.45       pe05                             15712      rtrace_bu|
|1       9.117.243.29       teamphred2                       35170      rtrace_bu|
|2       9.117.243.45       pe05                             6426       rtrace_bu|
|3       9.117.243.29       teamphred2                       40746      rtrace_bu|
|                                                                                |
|At the pdbx prompt enter the "attach" command followed by a                     |
|list of tasks or "all". (ex. "attach 2 4 5-7" or "attach all")                  |
|You may also type "help" for more information or "quit" to exit                 |
|the debugger without attaching.                                                 |
|                                                                                |
+--------------------------------------------------------------------------------+

After starting the debugger in attach mode, it displays a list of tasks from which you can choose. The pdbx Attach screen contains a list of tasks and, for each task, the following information:

The paging tool used to display the menu will default to pg -e unless the PAGER environment variable specifies another pager. the debugger displays a list of task numbers that comprise the parallel job. The debugger obtains this information by reading a configuration file created by POE when it begins a job step.

After initiating attach mode, select the tasks to which you want to attach. Since we don't know which task or set of tasks is causing the problem, we'll attach to all of the tasks by typing attach all:

+--------------------------------------------------------------------------------+
|pdbx(none) attach all                                                           |
|   0:Waiting to attach to process 15712 ...                                     |
|   0:Successfully attached to rtrace_bug.                                       |
|   2:Waiting to attach to process 6426 ...                                      |
|   2:Successfully attached to rtrace_bug.                                       |
|   3:Waiting to attach to process 40746 ...                                     |
|   3:Successfully attached to rtrace_bug.                                       |
|   1:Waiting to attach to process 35170 ...                                     |
|   1:Successfully attached to rtrace_bug.                                       |
|   0:reading symbolic information ...                                           |
|   0:stopped in read at 0xd01cdd84 ($t1)                                        |
|   0:0xd01cdd84 (read+0x118) 80410014        lwz   r2,0x14(r1)                  |
|   3:reading symbolic information ...                                           |
|   3:stopped in read at 0xd01cdd84 ($t1)                                        |
|   3:0xd01cdd84 (read+0x118) 80410014        lwz   r2,0x14(r1)                  |
|   2:reading symbolic information ...                                           |
|   2:stopped in read at 0xd01cdd84 ($t1)                                        |
|   2:0xd01cdd84 (read+0x118) 80410014        lwz   r2,0x14(r1)                  |
|   1:reading symbolic information ...                                           |
|   1:stopped in read at 0xd01cdd84 ($t1)                                        |
|   1:0xd01cdd84 (read+0x118) 80410014        lwz   r2,0x14(r1)                  |
|0029-2013 Debugger attached and ready.                                          |
+--------------------------------------------------------------------------------+

The debugger attaches to the specified tasks. The selected executables are stopped wherever their program counters happen to be, and are then under the control of the debugger. pdbx displays information about the attached tasks using the task numbering of the original POE application partition.

Let's start by taking a look at task 0. First, we'll change the current context to task 0 by typing on 0. We'll then take a look at the stack trace for task 0 by typing where:

+--------------------------------------------------------------------------------+
|pdbx(attached) on 0                                                             |
|                                                                                |
|pdbx(0) where                                                                   |
|   0:read(??, ??, ??) at 0xd01cdd84                                             |
|   0:readsocket(??) at 0xd2272720                                               |
|   0:kickpipes() at 0xd2266ed4                                                  |
|   0:mpci_recv_gen(??, ??, ??, ??, ??, ??, ??, ??) at 0xd2277e68                |
|   0:mpci_recv(??, ??, ??, ??, ??, ??, ??, ??) at 0xd2280834                    |
|   0:_mpi_recv(??, ??, ??, ??, ??, ??, ??) at 0xd21d6608                        |
|   0:MPI__Recv(??, ??, ??, ??, ??, ??, ??) at 0xd21d5038                        |
|   0:collect_pixels(taskid = 0, numtask = 4), line 101 in "rtrace_bug.c"        |
|   0:main(argc = 1, argv = 0x2ff22990), line 43 in "rtrace_bug.c"               |
|                                                                                |
+--------------------------------------------------------------------------------+

Since our code is hung in low level routines, lets take a look at the highest line in the stack trace that has a line number and a file name associated with it. This indicates that source code association is available. In our case, this is the line which contains collect_pixels, which is 7 lines up from the entry containing read. To look more closely at the collect_pixels routine, type up 7

pdbx(0) up 7    
			0:collect_pixels(taskid = 0, numtask = 4), line 101 in "rtrace_bug.c"  

Now, we can list the source code starting at the calling routine in collect_pixels:

pdbx(0) list    
0:  101         MPI_Recv(pixel_data, 2, MPI_INT, MPI_ANY_SOURCE,    
0:  102           MPI_ANY_TAG, MPI_COMM_WORLD, &stat);    
0:  103         mx--;    
0:  104       }    
0:  105     printf("Control node #%d: done receiving. ",taskid);    
0:  106     return;    
0:  107   }

Now you can see that task 0 is stopped on a MPI_RECV() call. To look at the local data values, type dump.

+--------------------------------------------------------------------------------+
|pdbx(0) dump                                                                    |
|   0:collect_pixels(taskid = 0, numtask = 4), line 101 in "rtrace_bug.c"        |
|   0:stat = (source = 1, tag = 0, error = -777142016, val1 = 8, val2 = 0,       |
|              val3 = 0, val4 = 1, val5 = -559038737)                            |
|   0:mx = 100                                                                   |
|   0:pixel_data = (15, 49)                                                      |
+--------------------------------------------------------------------------------+

When we look at the Local Data Values, we find that mx is still set to 100, so task 0 thinks it's still going to receive 100 messages. Now, let's take a look at what the other messages are doing. To get the stack information on task 1, type on 1 where

+--------------------------------------------------------------------------------+
|pdbx(0) on 1 where                                                              |
|   1:read(??, ??, ??) at 0xd01cdd84                                             |
|   1:readsocket(??) at 0xd2d1d720                                               |
|   1:kickpipes() at 0xd2d11ed4                                                  |
|   1:mpci_recv(??, ??, ??, ??, ??, ??, ??, ??) at 0xd2d2b68c                    |
|   1:barrier_shft_b(??) at 0xd2cb1a0c                                           |
|   1:_mpi_barrier(??, ??, ??) at 0xd2cb151c                                     |
|   1:MPI__Barrier(??) at 0xd2cb04e8                                             |
|   1:main(argc = 1, argv = 0x2ff22990), line 49 in "rtrace_bug.c"               |
|                                                                                |
|                                                                                |
+--------------------------------------------------------------------------------+

Task 1 has reached an MPI_Barrier() call. If we quickly check the other tasks, we see that they have all reached this point as well. So ... the problem is solved. Tasks 1 through 3 have completed sending messages but task 0 is still expecting to receive more. Task 0 was expecting 2500 messages but only got 2400, so it is still waiting for 100 messages. Let's see how many messages each of the other tasks are sending. To do this, we'll look at the global variables First_Line and Last_Line.

+--------------------------------------------------------------------------------+
|pdbx(0) on 2 where                                                              |
|   2:read(??, ??, ??) at 0xd01cdd84                                             |
|   2:readsocket(??) at 0xd2272720                                               |
|   2:kickpipes() at 0xd2266ed4                                                  |
|   2:mpci_recv(??, ??, ??, ??, ??, ??, ??, ??) at 0xd228068c                    |
|   2:barrier_shft_b(??) at 0xd21eaa0c                                           |
|   2:_mpi_barrier(??, ??, ??) at 0xd21ea51c                                     |
|   2:MPI__Barrier(??) at 0xd21e94e8                                             |
|   2:main(argc = 1, argv = 0x2ff22990), line 49 in "rtrace_bug.c"               |
|                                                                                |
|pdbx(0) on 3 where                                                              |
|   3:read(??, ??, ??) at 0xd01cdd84                                             |
|   3:readsocket(??) at 0xd2d1d720                                               |
|   3:kickpipes() at 0xd2d11ed4                                                  |
|   3:mpci_recv(??, ??, ??, ??, ??, ??, ??, ??) at 0xd2d2b68c                    |
|   3:barrier_shft_b(??) at 0xd2cb1a0c                                           |
|   3:_mpi_barrier(??, ??, ??) at 0xd2cb151c                                     |
|   3:MPI__Barrier(??) at 0xd2cb04e8                                             |
|   3:main(argc = 1, argv = 0x2ff22990), line 49 in "rtrace_bug.c"               |
+--------------------------------------------------------------------------------+

We can get the values of First_Line and Last_Line for all of the tasks by first changing the context to attached by typing on attached and then using the print command:

+--------------------------------------------------------------------------------+
|pdbx(0) on attached                                                             |
|                                                                                |
|pdbx(attached) print First_Line                                                 |
|   0:0                                                                          |
|   1:0                                                                          |
|   3:32                                                                         |
|   2:16                                                                         |
|                                                                                |
|pdbx(attached) print Last_Line                                                  |
|   0:0                                                                          |
|   1:16                                                                         |
|   3:48                                                                         |
|   2:32                                                                         |
+--------------------------------------------------------------------------------+

As you can see:

So, what happened to lines 48 and 49? Since each row is 50 pixels wide, and we are missing 2 rows, that explains the 100 missing messages. As you have probably already figured out, the division of the total number of lines by the number of tasks is not integral, so we lose part of the result when it is converted back to an integer. Where each task is supposed to be processing 16 and two-thirds lines, it is only handling 16.

Fix the problem

So how do we fix this problem permanently? We can proceed in one of the following ways:

In our case, since Task 1 was responsible for 16 and two thirds rows, it would process rows 0 through 16. Task 2 would process 17-33, and Task 3 would process 34-49. The way we're going to solve it is by creating blocks, with as many rows as there are servers. Each server is responsible for one row in each block (the offset of the row in the block is determined by the server's task number). The fixed code is shown in the following example. Note that this is only part of the program. You can access the entire program from the IBM RS/6000 World Wide Web site. See Accessing PE documentation online for more information.

/************************************************************************
*
* Ray trace program with bug corrected
*
* To compile:
* mpcc -g -o rtrace_good rtrace_good.c
*
*
* Description:
* This is part of a sample program that partitions N tasks into
* two groups, a collect node and N - 1 compute nodes.
* The responsibility of the collect node is to collect the data
* generated by the compute nodes. The compute nodes send the
* results of their work to the collect node for collection.
*
* The bug in the original code was due to the fact that each processing
* task determined the rows to cover by dividing the total number of
* rows by the number of processing tasks.  If that division was not
* integral, the number of pixels processed was less than the number of
* pixels expected by the collection task and that task waited
* indefinitely for more input.
*
* The solution is to allocate the pixels among the processing tasks
* in such a manner as to ensure that all pixels are processed.
*
************************************************************************/
 
compute_pixels(int taskid, int numtask)
{
  int  offset;
  int  row, col;
  int  pixel_data[2];
  MPI_Status stat;
 
  printf("Compute #%d: checking in\n", taskid);
 
  First_Line = (taskid - 1);
     /* First n-1 rows are assigned */
     /* to processing tasks         */
  offset = numtask - 1;
     /* Each task skips over rows   */
     /* processed by other tasks    */
 
     /* Go through entire pixel buffer, jumping ahead by numtask-1 each time */
for (row = First_Line; row < PIXEL_HEIGHT; row += offset)
  for ( col = 0; col < PIXEL_WIDTH; col ++)
    {
      pixel_data[0] = row;
      pixel_data[1] = col;
      MPI_Send(pixel_data, 2, MPI_INT, 0, 0, MPI_COMM_WORLD);
    }
  printf("Compute #%d: done sending. ", taskid);
  return;
}

This program is the same as the original one except for the loop in compute_pixels. Now, each task starts at a row determined by its task number and jumps to the next block on each iteration of the loop. The loop is terminated when the task jumps past the last row (which will be at different points when the number of rows is not evenly divisible by the number of servers).

What's the hangup?

The symptom of the problem in the rtrace_bug program was a hang. Hangs can occur for the same reasons they occur in serial programs (in other words, loops without exit conditions). They may also occur because of message passing deadlocks or because of some subtle differences between the parallel and sequential environments.

Using the debugger to analyze sometimes indicates that the source of a hang is a message that was never received, even though it's a valid one, and even though it appears to have been sent. In these situations, the problem is probably due to lost messages in the communication subsystem. This is especially true if the lost message is intermittent or varies from run to run. This is either the program's fault or the environment's fault. Before investigating the environment, you should analyze the program's safety with respect to MPI. A safe MPI program is one that does not depend on a particular implementation of MPI. You should also examine the error logs for evidence of repeated message transmissions (which usually indicate a network failure).

Although MPI specifies many details about the interface and behavior of communication calls, it also leaves many implementation details unspecified (and it doesn't just omit them, it specifies that they are unspecified.) This means that certain uses of MPI may work correctly in one implementation and fail in another, particularly in the area of how messages are buffered. An application may even work with one set of data and fail with another in the same implementation of MPI. This is because, when the program works, it has stayed within the limits of the implementation. When it fails, it has exceeded the limits. Because the limits are unspecified by MPI, both implementations are valid. MPI safety is discussed further in Chapter 6, Mostly harmless.

Once you have verified that the application is MPI-safe, your only recourse is to blame lost messages on the environment. If the communication path is IP, use the standard network analysis tools to diagnose the problem. Look particularly at mbuf usage. You can examine mbuf usage with the netstat command. Note that the netstat command is not a distributed command which means that it only applies to the node on which you execute it.

$ netstat -m

If the mbuf line shows any failed allocations, you should increase the thewall value of your network options. You can see your current setting with the no command. Note that the no command is not a distributed command which means that it only applies to the node on which you execute it.

$ no -a

The value presented for thewall is in KBytes. You can use the no command to change this value. You will have to have root access to do this. For example,

$ no -o thewall=16384

sets thewall to 16 MBytes.

Message passing between lots of remote hosts can tax the underlying IP system. Make sure that you look at all the remote nodes, not just the home node. Allow lots of buffers. If the communication path is user space (US), you'll need to get your system support people involved to isolate the problem.

Other hangups

One final cause for no output is a problem on the home node (POE is hung). Normally, a hang is associated with the remote hosts waiting for each other, or for a termination signal. POE running on the home node is alive and well, waiting patiently for some action on the remote hosts. If you type <Ctrl-c> on the POE console, you will be able to successfully interrupt and terminate the set of remote hosts. See IBM Parallel Environment for AIX: Operation and Use, Vol. 1 for information on the poekill command.

There are situations where POE itself can hang. Usually these situations are associated with large volumes of input or output. Remember that POE normally gets standard output from each node. If each task writes a large amount of data to standard output, it may chew up the IP buffers on the machine running POE, causing it (and all the other processes on that machine) to block and hang. The only way to know that this is the problem is by seeing that the rest of the home node has hung. If you think that POE is hung on the home node, your only solution may be to kill POE there. Press <Ctrl-c> several times, or use the command kill -9. At present, there are only partial approaches to avoiding the problem. You can allocate lots of mbufs on the home node, and don't make the send and receive buffers too large.

Bad output

Bad output includes unexpected error messages. After all, who expects error messages or bad results (results that are not correct).

Error messages

You can track down the causes of error messages and correct them in parallel programs using techniques similar to those used for serial programs. One difference, however, is that you need to identify which task is producing the message, if it's not coming from all tasks. You can do this by setting the MP_LABELIO environment variable to yes,

or using the -labelio yes command line parameter. Generally, the message will give you enough information to identify the location of the problem.

You may also want to generate more error and warning messages by setting the MP_EUIDEVELOP environment variable to yes

when you first start running a new parallel application. This will give you more information about the things that the message passing library considers errors or unsafe practices.

Bad results

You can track down bad results and correct them in a parallel program in a fashion similar to that used for serial programs. The process, as we saw in the previous debugging exercise, can be more complicated because the processing and control flow on one task may be affected by other tasks. In a serial program, you can follow the exact sequence of instructions that were executed and observe the values of all variables that affect the control flow. However, in a parallel program, both the control flow and the data processing on a task may be affected by messages sent from other tasks. For one thing, you may not have been watching those other tasks. For another, the messages could have been sent a long time ago. Therefore, it's very difficult to correlate a message that you receive with a particular series of events.

Debugging and threads

So far, we've talked about debugging normal old serial or parallel programs, but you may want to debug a threaded program (or a program that uses threaded libraries). If this is the case, there are a few things you should consider.

Before you do anything else, you first need to understand the environment in which you're working. You have the potential to create a multi-threaded application, using a multi-threaded library, that consists of multiple distributed tasks. As a result, finding and diagnosing bugs in this environment may require a different set of debugging techniques that you're not used to using. Here are some things to remember.

When you attach to a running program, all the tasks you selected in your program will be stopped at their current points of execution. Typically, you want to see the current point of execution of your task. This stop point is the position of the program counter, and may be in any one of the many threads that your program may create OR any one of the threads that the MPI library creates. With non-threaded programs, it was adequate to just travel up the program stack until you reached your application code (assuming you compiled your program with the -g option). But with threaded programs, you now need to traverse across other threads to get to your thread(s) and then up the program stack to view the current point of execution of your code.

If you're using the threaded MPI library, the library itself will create a set of threads to process message requests. When you attach to a program that uses the MPI library, all of the threads associated with the POE job are stopped, including the ones created and used by MPI.

It's important to note that to effectively debug your application, you must be aware of how threads are dispatched. When a task is stopped, all threads are also stopped. Each time you issue an execution command, such as step over, step into, step return, or continue, all the threads are released for execution until the next stop (at which time they are stopped, even if they haven't completed their work). This stop may be at a breakpoint you set or the result of a step. A single step over an MPI routine may prevent the MPI library threads from completely processing the message that is being exchanged.

For example, if you wanted to debug the transfer of a message from a send node to a receiver node, you would step over an MPI_SEND() in your program on task 1, switch to task 2, then step over the MPI_RECV() on task 2. Unless the MPI threads on task 1 and 2 have the opportunity to process the message transfer, it will appear that the message was lost. Remember... the window of opportunity for the MPI threads to process the message is brief, and is only open during the step over. Otherwise, the threads will be stopped. Longer-running execution requests, of both the sending and receiving nodes, allow the message to be processed and, eventually, received.

For more information on debugging threaded and non-threaded MPI programs with the PE debugging tool, (pdbx), see IBM Parallel Environment for AIX: Operation and Use, Vol. 2, which provides more detailed information on how to manage and display threads.

For more information on the threaded MPI library, see IBM Parallel Environment for AIX: MPI Programming Guide.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]