Once POE can be started, you'll need to consider the problems that can arise in running a parallel program, specifically initializing the message passing subsystem. The way to eliminate this initialization as the source of POE startup problems is to run a program that does not use message passing.
As discussed in Running POE, you can use POE to invoke any AIX command or serial program on remote nodes. If you can get an AIX command or simple program, like Hello, World!, to run under POE, but a parallel program doesn't, you can be pretty sure the problem is in the message passing subsystem. The message passing subsystem is the underlying implementation of the message passing calls used by a parallel program (in other words, an MPI_Send). POE code that's linked into your executable by the compiler script (mpcc, mpCC, mpxlf mpcc_r, mpCC_r, mpxlf_r) initializes the message passing subsystem.
The Parallel Operating Environment (POE) supports two distinct communication subsystems, an IP-based system, and User Space optimized adapter support for the SP Switch and SP Switch2. The subsystem choice is normally made at run time, by environment variables or command line options passed to POE. Use the IP subsystem for diagnosing initialization problems before worrying about the User Space (US) subsystem. Select the IP subsystem by setting the environment variables:
$ export MP_EUILIB=ip $ export MP_EUIDEVICE=en0
Use specific remote hosts in your host list file and don't use LoadLeveler (set MP_RESD=no).
If you don't have a small parallel program around, recompile hello.c as follows:
$ mpcc -o hello_p hello.c
and make sure that the executable is loadable on the remote host that you are using.
Type the following command, and then look at the messages on the console:
$ poe hello_p -procs 1 -infolevel 4
If the last message that you see looks like this:
Calling mpci_connect
and there are no further messages, there's an error in opening a UDP socket on the remote host. Check to make sure that the IP address of the remote host is correct, as reported in the informational messages printed out by POE. Also, perform any other IP diagnostic procedures of which you are aware.
If you get
Hello, World!
then the communication subsystem has been successfully initialized on the one node and things ought to be looking good. Just for kicks, make sure that there are two remote nodes in your host list file and try again with the following:
$ poe hello_p -procs 2
If and when hello_p works with IP and device en0 (the Ethernet), try again with the SP Switch.
Each SP node has one name that it is known by on the external LAN to which it is connected and another name it is known by on the SP Switch. If the node name you use is not the proper name for the network device you specify, the connection will not be made. You can put the names in your host list file. Otherwise you will have to use LoadLeveler to locate the nodes.
For example,
$ export MP_RESD=yes $ export MP_EUILIB=ip $ export MP_EUIDEVICE=css0 $ poe hello_p -procs 2 -ilevel 2
where css0 is the switch device name.
Look at the console lines containing the string init_data. These identify the IP address that is actually being used for message passing (as opposed to the IP address that is used to connect the home node to the remote hosts.) If these aren't the switch IP addresses, check the LoadLeveler configuration and the switch configuration.
Once IP works, and you're on an SP machine, you can try message passing using the User Space device support. Note that LoadLeveler allows you to run multiple tasks over the switch adapter while in User Space.
You can run hello_p with the User Space library by typing:
$ export MP_RESD=yes $ export MP_EUILIB=us $ export MP_EUIDEVICE=css0 $ poe hello_p -procs 2 -ilevel 6
The console log should inform you that you're using User Space support, and that LoadLeveler is allocating the nodes for you. LoadLeveler tells you that it can't allocate the requested nodes if someone else is already running on them and has requested dedicated use of the switch, or if User Space capacity has been exceeded.
So, what do you do now? You can try for other specific nodes, or you can ask LoadLeveler for non-specific nodes from a pool, but by this time, you're probably far enough along that we can just refer you to IBM Parallel Environment for AIX: Operation and Use, Vol. 1.