NCSA Home
Contact Us | Intranet | Search

Compiling and Running with MPICH2 and the gdb Debugger

The MPICH2 environment is setup with softenv on tungsten, mercury, and cobalt. It supports a text mode gdb interface and has been tested at scale [>200 processes]. Debugging with mpich2 requires a recompile/relink of your code and since it is installed only for tcp/ip networking, performance will be slower than with the default MPI. Also, mpich2 may differ enough from the default MPI that your bug changes behavior or disappears altogether. The softenv keys [for your $HOME/.soft] are shown in the table below along with the location of a sample batch script for MPICH2:

machine
softenv key(s)
batch.sample location
tungsten
+mpich2-tcp-1.0.2-intel9
/usr/apps/mpi/mpich2-102/batch.sample
mercury
+intel-c-9.1.043-f-9.1.037-r1
+mpich2-tcp-1.0.2-intel9
/usr/projects/mpich2/mpi/mpich2-102/batch.sample
cobalt
+mpich2-1.0.2
/usr/apps/mpi/mpich2-102/batch.sample

Using your favorite editor, add the appropriate key(s) on the line before @default or @teragrid in your $HOME/.soft, then login again and MPICH2 is ready.  Recompile with the "-g" flag to include symbol information in the executable code. Compilation commands are as with standard MPICH: mpicc, mpicxx, mpif77, and mpif90 for C, C++, Fortran77, and Fortran90 respectively. See the batch.sample example scripts as indicated in the table for the steps to run applications within the MPICH2 environment.

Create a $HOME/.mpd.conf if you do not already have one. The mpd daemons require this file.

$ echo "MPD_SECRETWORD=mywordabc" > $HOME/.mpd.conf
$ chmod 600 $HOME/.mpd.conf

Typically, you'll want to get a batch interactive job to work with mpirun -gdb.  Here are examples showing how to request 16 processors in the debug queues on the systems listed above along with the startup commands for MPICH2.  Commands are shown in bold.

tungsten example

[arnoldg@tunc ~]$ bsub -Is -n16 -W00:30 -qdebug tcsh
WARNING: Project is not specified or invalid, defaulting to aau
Job <522227> is submitted to queue <debug>.
<<Waiting for dispatch ...>>
<<Starting on tune256>>
[arnoldg@tune256 mpich2-102]$ setenv NODES `uniq $LSB_NODEFILE | wc -l`
[arnoldg@tune256 mpich2-102]$ mpdboot -n $NODES -f $LSB_NODEFILE
[arnoldg@tune256 ~]$ mpdtrace -l
tune256_50178
tune249_33578
tune250_38301
tune252_33756
tune253_33039
tune251_52664
tune255_42282
tune254_38568
[arnoldg@tune256 ~/mpi]$ mpirun -gdb -np 8 hello_world_mpich2
0-7: (gdb) run
0-7: Continuing.
0: Hello world! I'm 0 of 8 on tune256
1: Hello world! I'm 1 of 8 on tune249
2: Hello world! I'm 2 of 8 on tune250
3: Hello world! I'm 3 of 8 on tune252
4: Hello world! I'm 4 of 8 on tune253
5: Hello world! I'm 5 of 8 on tune251
6: Hello world! I'm 6 of 8 on tune255
7: Hello world! I'm 7 of 8 on tune254
0: [0] 96 at [0x080dfc70], id = 1 dbginit.c[89]
0-7:
0,2-7: Program exited normally.
0,2-7: (gdb) 1: Program received signal SIGSEGV, Segmentation fault.
1: 0x08049311 in main (argc=1, argv=0xbfffe154) at hello_world_tv.c:33
1: 33 *f=3.5;
1: (gdb)

mercury example [also applicable to mvapich2 on abe]

ncsa/arnoldg> qsub -I -V -lnodes=8:ppn=2,walltime=00:30:00 -qdebug
This job will be charged to project: TG-STA040012
qsub: waiting for job 476910.tg-master.ncsa.teragrid.org to start
qsub: job 476910.tg-master.ncsa.teragrid.org ready

----------------------------------------
Begin PBS Prologue Wed Oct 19 09:59:08 CDT 2005
Job ID: 476910.tg-master.ncsa.teragrid.org
Username: arnoldg
Group: afw
Nodes: tg-c273 tg-c274 tg-c275 tg-c276 tg-c277 tg-c882 tg-c883 tg-cs04
End PBS Prologue Wed Oct 19 09:59:14 CDT 2005
----------------------------------------
Directory: /home/ncsa/arnoldg
Wed Oct 19 09:59:14 CDT 2005
ncsa/arnoldg> setenv NODES `uniq $PBS_NODEFILE | wc -l`
ncsa/arnoldg> mpdboot -n $NODES -f $PBS_NODEFILE
ncsa/arnoldg> mpdtrace -l
tg-c277_59996
tg-c882_50164
tg-cs04_33091
tg-c883_48442
tg-c273_36797
tg-c274_50208
tg-c275_49110
tg-c276_48408
ncsa/arnoldg>
arnoldg/mpi> mpirun -gdb -np 16 hello_world_tv
0-15: (gdb) run
0-15: Continuing.
0: Hello world! I'm 0 of 16 on tg-c277
1: Hello world! I'm 1 of 16 on tg-c882
2: Hello world! I'm 2 of 16 on tg-cs04
3: Hello world! I'm 3 of 16 on tg-c883
4: Hello world! I'm 4 of 16 on tg-c273
5: Hello world! I'm 5 of 16 on tg-c274
6: Hello world! I'm 6 of 16 on tg-c275
7: Hello world! I'm 7 of 16 on tg-c276
8: Hello world! I'm 8 of 16 on tg-c277
9: Hello world! I'm 9 of 16 on tg-c882
10: Hello world! I'm 10 of 16 on tg-cs04
11: Hello world! I'm 11 of 16 on tg-c883
12: Hello world! I'm 12 of 16 on tg-c273
13: Hello world! I'm 13 of 16 on tg-c274
14: Hello world! I'm 14 of 16 on tg-c275
15: Hello world! I'm 15 of 16 on tg-c276
0-15:
1: Program received signal SIGSEGV, Segmentation fault.
0,2-15: Program exited normally.
0,2-15: (gdb) 1: 0x4000000000002631 in main (argc=1, argv=0x60000fffffffa298)
1: at hello_world_tv.c:33
1: 33 *f=3.5;
1: (gdb)

cobalt example

[arnoldg@co-login1 ~]$ qsub -I -V -lncpus=16,mem=2gb,walltime=00:30:00 -qdebug
qsub: requesting a CPU Memory set with 16 nodes
This job will be charged to project: aau
qsub: waiting for job 30915.co-master1 to start
qsub: job 30915.co-master1 ready

----------------------------------------
!Begin PBS Prologue Wed Oct 19 09:53:01 CDT 2005
Job ID: 30915
Username: arnoldg
Group: aau
Creating Batch Directory 30915 in /scratch/batch
----------------------------------------

set_SCR: using existing PBS job directory /scratch/batch/30915
[arnoldg@co-login1 ~]$ mpd &
[1] 5866
[arnoldg@co-login1 ~]$ mpdtrace -l
co-login1.ncsa.uiuc.edu_40365
[arnoldg@co-login1 ~]$ mpirun -gdb -np 16 hello_world_mpich2
0-15: (gdb) run
0-15: Continuing.
0: Hello world! I'm 0 of 16 on co-login1.ncsa.uiuc.edu
1: Hello world! I'm 1 of 16 on co-login1.ncsa.uiuc.edu
2: Hello world! I'm 2 of 16 on co-login1.ncsa.uiuc.edu
3: Hello world! I'm 3 of 16 on co-login1.ncsa.uiuc.edu
4: Hello world! I'm 4 of 16 on co-login1.ncsa.uiuc.edu
5: Hello world! I'm 5 of 16 on co-login1.ncsa.uiuc.edu
6: Hello world! I'm 6 of 16 on co-login1.ncsa.uiuc.edu
7: Hello world! I'm 7 of 16 on co-login1.ncsa.uiuc.edu
8: Hello world! I'm 8 of 16 on co-login1.ncsa.uiuc.edu
9: Hello world! I'm 9 of 16 on co-login1.ncsa.uiuc.edu
10: Hello world! I'm 10 of 16 on co-login1.ncsa.uiuc.edu
11: Hello world! I'm 11 of 16 on co-login1.ncsa.uiuc.edu
12: Hello world! I'm 12 of 16 on co-login1.ncsa.uiuc.edu
13: Hello world! I'm 13 of 16 on co-login1.ncsa.uiuc.edu
14: Hello world! I'm 14 of 16 on co-login1.ncsa.uiuc.edu
15: Hello world! I'm 15 of 16 on co-login1.ncsa.uiuc.edu
0-15:
1: Program received signal SIGSEGV, Segmentation fault.
0,2-15: Program exited normally.
0,2-15: (gdb) 1: 0x4000000000002471 in main (argc=1, argv=0x60000fffffff8c58)
1: at hello_world_tv.c:33
1: 33 *f=3.5;
1: (gdb)

Note: The IBM pdbx parallel debugger provides similar functionality to the -gdb debugger option with MPICH2. There is a good tutorial and guide for pdbx parallel debugging that includes the section IBM parallel debugging tips. Many of the concepts and techniques from that document can be applied to mpirun -gdb.

The example table below shows mpirun using the -gdb flag to isolate a pointer bug associated with a single mpi rank.  Note how rank stdout is labeled and the use of the "z" command to change the rank focus for gdb. Use the "z" command to change gdb focus to a rank [z N], a comma separated set of ranks [z M,N], or a range of ranks [z  M-N].

Example showing change of gdb rank focus:

[arnoldg@co-login1 ~/mpi]$ mpirun -gdb -np 4 hello_world_mpich2
0-3:  (gdb) z 2-3
2-3:  (gdb) z 1,2
1,2:  (gdb) z 4
4:  (gdb)
Input text is displayed in bold:

mpirun -gdb example session [segmentation fault]

[arnoldg@tund170 ~/mpi]$ mpirun -gdb -np 8 hello_world_mpich2
0-7: (gdb) run
0-7: Continuing.
0: Hello world! I'm 0 of 8 on tund170
1: Hello world! I'm 1 of 8 on tund053
2: Hello world! I'm 2 of 8 on tund069
3: Hello world! I'm 3 of 8 on tund169
4: Hello world! I'm 4 of 8 on tund170
5: Hello world! I'm 5 of 8 on tund053
6: Hello world! I'm 6 of 8 on tund069
7: Hello world! I'm 7 of 8 on tund169
0: [0] 96 at [0x080dfc70], id = 1 dbginit.c[89]
0-7:
0,2-7: Program exited normally.
0,2-7: (gdb) 1: Program received signal SIGSEGV, Segmentation fault.
1: 0x08049311 in main (argc=1, argv=0xbfffe494) at hello_world_tv.c:33
1: 33 *f=3.5;
1: (gdb) list
0,2-7: 16
1: 28
0,2-7: 17 {
0,2-7: 18 int rank, size, len;
0,2-7: 19 char name[MPI_MAX_PROCESSOR_NAME];
0,2-7: 20
0,2-7: 21 MPI_Init(&argc, &argv);
0,2-7: 22 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
0,2-7: 23 MPI_Comm_size(MPI_COMM_WORLD, &size);
1: 29 if ( (rank ==1) || (rank == 31) )
0,2-7: 24
0,2-7: 25 MPI_Get_processor_name(name, &len);
0,2-7: (gdb) 1: 30 {
1: 31 double *f;
1: 32 f=0;
1: 33 *f=3.5;
1: 34 }
1: 35 MPI_Finalize();
1: 36 exit(0);
1: 37 }
1: (gdb) z 2-7
2-7: (gdb) where
2-7: No stack.
2-7: (gdb) list
2-7: 26
2-7: 27 printf ("Hello world! I'm %d of %d on %s\n", rank, size, name);
2-7: 28
2-7: 29 if ( (rank ==1) || (rank == 31) )
2-7: 30 {
2-7: 31 double *f;
2-7: 32 f=0;
2-7: 33 *f=3.5;
2-7: 34 }
2-7: 35 MPI_Finalize();
2-7: (gdb) quit
rank 1 in job 1 tund170_40949 caused collective abort of all ranks
exit status of rank 1: killed by signal 9
0,5-7: MPIGDB ENDING
[arnoldg@tund170 ~/mpi]$

MPI programs occasionally hang due to logic errors within a program.  Consider the consequences of the following code example:

        if ( (rank ==1) || (rank == 31) )
        {
                char buf[255];
                MPI_Request request;
                MPI_Status status;
                MPI_Recv( (void *)buf, 1, MPI_CHAR, 1, 55, MPI_COMM_WORLD,&status);
                MPI_Send( (void *)buf, 1, MPI_CHAR, 1, 55, MPI_COMM_WORLD);
        }

This code has a couple of errors and the blocking MPI_Recv will cause it to hang for rank 1 or 31. That's the observed runtime behavior. MPICH2 and gdb can help a little with this sort of problem. Notice how rank 1 disappears from the gdb labeling in the example below indicating an issue with rank 1 [it's hanging]:

mpirun -gdb example session [program hangs]
[arnoldg@co-login1 ~/mpi]$ mpirun -gdb -np 4 hello_hang
0-3:  (gdb) run
0-3:  Continuing.
0:  Hello world! I'm 0 of 4 on co-login1.ncsa.uiuc.edu
1:  Hello world! I'm 1 of 4 on co-login1.ncsa.uiuc.edu
2:  Hello world! I'm 2 of 4 on co-login1.ncsa.uiuc.edu
3:  Hello world! I'm 3 of 4 on co-login1.ncsa.uiuc.edu
0,2-3:
0,2-3:  Program exited normally.
0,2-3:  (gdb)
The program hangs right here with gdb.  Rank 1 has dropped out of the display.  In another terminal window, look for the remaining running process [this is easy on a shared memory system like cobalt, for a cluster environment the compute node running rank 1 would have to be found and checked with ps].  Then send it the SEGV signal with kill so that it will terminate:

[arnoldg@co-login1 ~/mpi]$ ps auxw|grep hello_hang
arnoldg  30384  2.4  0.0 61184 22256 pts/46  SN   12:47   0:00 gdb -q hello_hang
arnoldg  30385  2.4  0.0 61184 22256 pts/46  SN   12:47   0:00 gdb -q hello_hang
arnoldg  30386  2.5  0.0 61184 22256 pts/46  SN   12:47   0:00 gdb -q hello_hang
arnoldg  30387  2.5  0.0 61184 22256 pts/46  SN   12:47   0:00 gdb -q hello_hang
arnoldg  30389  0.5  0.0  5104 2096 pts/46   SN   12:47   0:00 /u/ncsa/arnoldg/mpi/hello_hang
[arnoldg@co-login1 ~/mpi]$ kill -SEGV 30389
[arnoldg@co-login1 ~/mpi]$

Immediately after sending the SEGV signal, the gdb session catches the signal and returns a prompt so that debugging may continue.  The source code line causing the hang is noted in the output from the where command:
          1:
1:  Program received signal SIGSEGV, Segmentation fault.
1:  __poll (fds=0x6000000000044a10, nfds=1, timeout=-1)
1:      at ../sysdeps/unix/sysv/linux/poll.c:82
1:  82  ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
1:      in ../sysdeps/unix/sysv/linux/poll.c
1:  (gdb) where
0,2-3:  No stack.
1:  #0  __poll (fds=0x6000000000044a10, nfds=1, timeout=-1)
0,2-3:  (gdb) 1:      at ../sysdeps/unix/sysv/linux/poll.c:82
1:  #1  0x400000000006b130 in MPIDU_Sock_wait ()
1:  #2  0x4000000000012920 in MPIDI_CH3_Progress_wait ()
1:  #3  0x400000000000e6f0 in PMPI_Recv ()
1:  #4  0x40000000000024c0 in main (argc=1, argv=0x60000fffffff8b98)
1:      at hello_hang.c:34
1:  (gdb) z 1
1:  (gdb) list
1:  34                  MPI_Recv( (void *)buf, 1, MPI_CHAR, 1, 55, MPI_COMM_WORLD,&status);
1:  35                  MPI_Send( (void *)buf, 1, MPI_CHAR, 1, 55, MPI_COMM_WORLD);
1:  36          }
1:  37          MPI_Finalize();
1:  38          exit(0);
1:  39  }
1:  (gdb) quit
 rank 1 in job 403  co-login1.ncsa.uiuc.edu_49348   caused collective abort of all ranks
  exit status of rank 1: killed by signal 9
0,3:  MPIGDB ENDING
[arnoldg@co-login1 ~/mpi]$

References