Introduction.
Tungsten
mpi example.
Cobalt
serial example.
What about the IBMp690 [copper] ?
The Electric Fence malloc debugger is
installed on the login hosts of the NCSA tungsten, mercury, and cobalt
clusters. It can be used with serial or mpi programs containing c
memory allocation routines [free(), malloc(), ...] and pointers.
Since c memory allocation is a frequent source of program bugs, it's a
good idea to try electric fence when programs exhibit sporadic
errors. To get started with Electric Fence, see the efence man
page:
[tunb
~/c]$ man efence
efence(3)
efence(3)
NAME
efence - Electric Fence Malloc Debugger
SYNOPSIS
#include <stdlib.h>
...
This example was run on the tungsten
cluster.
Electric Fence doesn't
work with the myrinet/gm malloc() so the code was compiled with
a version of mpich-tcp available via softenv:
[tunb
~/c]$ cat $HOME/.soft
@default
+mpich-tcp-1.2.5.2-intel8
[tunb ~/c]$ cat $HOME/.cshrc
limit coredumpsize 100000
[tunb ~/c]$ cat hello_world.c
#include <stdio.h>
#include <mpi.h>
main(argc, argv)
int
argc;
char
*argv[];
{
int
rank, size, len;
char
name[MPI_MAX_PROCESSOR_NAME];
double *badptr;
MPI_Init(&argc, &argv);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Get_processor_name(name, &len);
MPI_Barrier(MPI_COMM_WORLD);
system("/bin/uname -a");
printf ("Hello world! I'm %d of %d on %s\n", rank, size, name);
if (rank == 1)
{
free(badptr);
}
MPI_Finalize();
exit(0);
}
Note that the program may typically compile and run without
showing an error:
[tunb
~/c]$ mpicc -g -o hello_world
hello_world.c
[tunb ~/c]$ cat hosts
tunb
tunb
[tunb ~/c]$ mpirun -np 2 -machinefile hosts hello_world
Linux tunb
2.4.20-31.9smp_perfctr_lustre #2 SMP Thu Jun 24 21:02:14 CDT 2004 i686
i686 i386 GNU/Linux
Linux tunb
2.4.20-31.9smp_perfctr_lustre #2 SMP Thu Jun 24 21:02:14 CDT 2004 i686
i686 i386 GNU/Linux
Hello world! I'm 0 of 2 on
tunb.ncsa.uiuc.edu
Hello world! I'm 1 of 2 on
tunb.ncsa.uiuc.edu
That is misleading though, since calling free() with an
unassigned pointer is an error. Electric Fence can usually catch
such mistakes. Re-link with -lefence and see what happens
[note, linking with -lefence will slow
down the resultant program, so drop -lefence in the final link after
debugging]:
[tunb
~/c]$ mpicc -g -o hello_world
hello_world.c -lefence
[tunb ~/c]$ mpirun -np 2 -machinefile hosts hello_world
Electric Fence 2.2.0
Copyright (C) 1987-1999 Bruce Perens <bruce@perens.com>
Electric Fence 2.2.0
Copyright (C) 1987-1999 Bruce Perens <bruce@perens.com>
Linux tunb
2.4.20-31.9smp_perfctr_lustre #2 SMP Thu Jun 24 21:02:14 CDT 2004 i686
i686 i386 GNU/Linux
Hello world! I'm 0 of 2 on
tunb.ncsa.uiuc.edu
Linux tunb
2.4.20-31.9smp_perfctr_lustre #2 SMP Thu Jun 24 21:02:14 CDT 2004 i686
i686 i386 GNU/Linux
ElectricFence Aborting:
free(40016148): address not from malloc().
Illegal instruction (core dumped)
Killed by signal 2.
The gdb debugger can throw a little more light upon that error
message from Electric Fence:
[tunb
~/c]$ gdb hello_world core.9073
GNU gdb Red Hat Linux
(5.3post-0.20021129.18rh)
Copyright 2003 Free Software
Foundation, Inc.
GDB is free software, covered by
the GNU General Public License, and you are
welcome to change it and/or
distribute copies of it under certain conditions.
Type "show copying" to see the
conditions.
There is absolutely no warranty
for GDB. Type "show warranty" for details.
This GDB was configured as
"i386-redhat-linux-gnu"...
Core was generated by
`/u/ncsa/arnoldg/c/hello_world tunb 39600 4amslave
-p4yourname tunb -p4rmrank'.
Program terminated with signal 4,
Illegal instruction.
Reading symbols from
/usr/lib/libefence.so.0...done.
Loaded symbols for
/usr/lib/libefence.so.0
Reading symbols from
/lib/tls/libc.so.6...done.
Loaded symbols for
/lib/tls/libc.so.6
Reading symbols from
/usr/local/pgi/linux86/5.2/lib/libpgc.so...done.
Loaded symbols for
/usr/local/pgi/linux86/5.2/lib/libpgc.so
Reading symbols from
/lib/tls/libm.so.6...done.
Loaded symbols for
/lib/tls/libm.so.6
Reading symbols from
/lib/ld-linux.so.2...done.
Loaded symbols for
/lib/ld-linux.so.2
Reading symbols from
/lib/libnss_files.so.2...done.
Loaded symbols for
/lib/libnss_files.so.2
#0 0xffffe002 in ?? ()
(gdb) where
#0 0xffffe002 in ?? ()
#1 0x4002d346 in EF_Abort
() from /usr/lib/libefence.so.0
#2 0x4002cc1e in free ()
from /usr/lib/libefence.so.0
#3 0x08049ffa in main
(argc=1, argv=0x40112f84) at hello_world.c:40
#4 0x420156a4 in
__libc_start_main () from /lib/tls/libc.so.6
(gdb) l
40
free(badptr);
41
}
42
43
44
MPI_Finalize();
45
exit(0);
46 }
(gdb) q
[tunb ~/c]$
This is a simple serial program with a bug
diagnosed using Electric Fence on cobalt.
[co-login1 ~/c]$ cat efence_test.c
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
int main(void)
{
char *myptr;
printf("begin\n");
myptr= malloc(1);
strcpy(myptr,
"end
done");
printf("%s\n", myptr);
}
Once again, the program appears to compile and run without incident.
[co-login1 ~/c]$ icc -g -o efence_test efence_test.c
[co-login1 ~/c]$ ./efence_test
begin
end
done
Calling the program with the ef utility, wraps the library selection so
that the Electric Fence library is consulted 1st for c allocation
routines. This has the same effect as linking with -lefence
without requiring any change to the program. Gdb may be used to
get more information about the bug.
[co-login1 ~/c]$ limit coredumpsize 100000
[co-login1 ~/c]$ rm -f core*
[co-login1 ~/c]$ ef ./efence_test
Electric Fence 2.2.0
Copyright (C) 1987-1999 Bruce Perens <bruce@perens.com>
begin
/usr/bin/ef: line 20: 32183
Segmentation fault (core dumped) ( export
LD_PRELOAD=libefence.so.0.0; exec $* )
[co-login1 ~/c]$ gdb efence_test core*
GNU gdb Red Hat Linux (5.2-2)
Copyright 2002 Free Software
Foundation, Inc.
GDB is free software, covered by
the GNU General Public License, and you are
welcome to change it and/or
distribute copies of it under certain conditions.
Type "show copying" to see the
conditions.
There is absolutely no warranty
for GDB. Type "show warranty" for details.
This GDB was configured as
"ia64-redhat-linux"...
Core was generated by `'.
Program terminated with signal
11, Segmentation fault.
Reading symbols from
/usr/lib/libefence.so.0.0...done.
Loaded symbols for
/usr/lib/libefence.so.0.0
Reading symbols from
/lib/libm.so.6.1...done.
Loaded symbols for
/lib/libm.so.6.1
Reading symbols from
/usr/local/intel/8.0.069/lib/libcprts.so.6...done.
Loaded symbols for
/usr/local/intel/8.0.069/lib/libcprts.so.6
Reading symbols from
/usr/local/intel/8.0.069/lib/libcxa.so.6...done.
Loaded symbols for
/usr/local/intel/8.0.069/lib/libcxa.so.6
Reading symbols from
/usr/local/intel/8.0.069/lib/libunwind.so.6...done.
Loaded symbols for
/usr/local/intel/8.0.069/lib/libunwind.so.6
Reading symbols from
/lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from
/lib/libc.so.6.1...done.
Loaded symbols for
/lib/libc.so.6.1
Reading symbols from
/lib/libpthread.so.0...done.
Loaded symbols for
/lib/libpthread.so.0
Reading symbols from
/lib/ld-linux-ia64.so.2...done.
Loaded symbols for
/lib/ld-linux-ia64.so.2
#0 0x20000000004056e0 in
strcpy () at soinit.c:56
56
soinit.c: No such file or directory.
in soinit.c
(gdb) where
#0 0x20000000004056e0 in
strcpy () at soinit.c:56
#1 0x4000000000001270 in
main () at efence_test.c:16
(gdb) l
51
in soinit.c
(gdb) l efence_test.c:16
11
*/
12
13
myptr= malloc(80 * sizeof(char) );
14
myptr= malloc(1);
15
16
strcpy(myptr, "end
done");
17
printf("%s\n", myptr);
18 }
(gdb) q
[co-login1 ~/c]$
For the IBM p690 cluster, there's a nice compile
time option [ -qheapdebug ] that can help with memory allocation
bugs. It's similar to Electric Fence. Here's the first
example again showing the use of -qheapdebug:
Cu12:~/c115% cat ef_mpi.c
#include <stdio.h>
#include <mpi.h>
main(argc, argv)
int
argc;
char
*argv[];
{
int
rank, size, len;
char
name[MPI_MAX_PROCESSOR_NAME];
char *badptr;
MPI_Init(&argc, &argv);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Get_processor_name(name, &len);
MPI_Barrier(MPI_COMM_WORLD);
system("/bin/uname -a");
printf ("Hello world! I'm %d of %d on %s\n", rank, size, name);
if (rank == 1)
{
badptr=(char *) malloc(1 * sizeof(char));
strcat(badptr,"this is a test
string
.");
}
MPI_Finalize();
exit(0);
}
Cu12:~/c116% mpcc_r -qheapdebug -g -o ef_mpi ef_mpi.c
Note that you can run mpi poe applications with or without the poe
command because poe the libraries are compiled in to the program.
Cu12:~/c117%
./ef_mpi -procs 2
AIX Cu12 1 5 0024BB0A4C00
Hello world! I'm 1 of 2 on Cu12
1546-504 Internal storage object
was overwritten at 0x2458E5AB.
AIX Cu12 1 5 0024BB0A4C00
1546-522 Traceback:
20112070 = _debug_strcat + 0x94
201115f4 = strcat + 0x20
100004a4 = main + 0xE0
Hello world! I'm 0 of 2 on Cu12
ERROR: 0031-250 task 1:
IOT/Abort trap
ERROR: 0031-250 task 0:
Terminated
The runtime error already provides a good clue with the strcat
information in the traceback. The pdbx debugger can further
illuminate the problem:
Cu12:~/c123% pdbx ef_mpi -procs 2
pdbx Version 3, Release 2 --
May 5 2004 14:06:57
reading symbolic information ...
reading symbolic information ...
[1] stopped in main at line 24
($t1)
24
MPI_Init(&argc, &argv);
[1] stopped in main at line 24
($t1)
24
MPI_Init(&argc, &argv);
0031-504 Partition loaded
...
pdbx(all) cont
AIX Cu12 1 5 0024BB0A4C00
AIX Cu12 1 5 0024BB0A4C00
Hello world! I'm 1 of 2 on Cu12
Hello world! I'm 0 of 2 on Cu12
1546-504 Internal storage object
was overwritten at 0x2458E5AB.
1546-522 Traceback:
20112070 = _debug_strcat + 0x94
201115f4 = strcat + 0x20
100004a4 = main + 0xE0
IOT/Abort trap in pthread_kill at
0xd005cb14 ($t1)
0xd005cb14 (pthread_kill+0xa8)
80410014 lwz
r2,0x14(r1)
^C
pdbx-subset(all) tasks
0:R 1:D
pdbx-subset(all) on 1
pdbx(1) where
pthread_kill(??, ??) at 0xd005cb14
_p_raise(??) at 0xd005c120
_tm_msg_print(??, ??, ??, ??, ??)
at 0x20193380
_mem_error(??, ??, ??, ??, ??) at
0x2019ac98
_test_dbg_allocated(??) at
0x20193494
_int_uheap_verify(??, ??, ??, ??)
at 0x20193b98
_chk_if_heap(??, ??, ??, ??) at
0x20198440
_debug_strcat(??, ??, ??, ??) at
0x20112070
@cbase.strcat(??, ??) at
0x201115f4
main(argc = 1, argv = 0x2ff224e4,
... = 0x2ff224ec, 0x28, 0x2ff22ff8, 0x0, 0x10348007, 0x5), line 41 in "ef_mpi.c"
pdbx(1) l
25
MPI_Barrier(MPI_COMM_WORLD);
26
27
28
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
29
MPI_Comm_size(MPI_COMM_WORLD, &size);
30
31
MPI_Get_processor_name(name, &len);
32
MPI_Barrier(MPI_COMM_WORLD);
33
34
system("/bin/uname -a");
pdbx(1) l
35
36 printf
("Hello world! I'm %d of %d on %s\n", rank, size, name);
37
38 if
(rank == 1)
39 {
40
badptr=(char *) malloc(1 * sizeof(char));
41
strcat(badptr,"this is a test string .");
42 }
43
44
pdbx(1) q
Cu12:~/c108%