Once you have successfully compiled your program, you either invoke it directly or start the Parallel Operating Environment (POE) and then submit the program to it. In both cases, POE is started to establish communication with the parallel nodes. Problems that can occur at this point include:
or
These problems can be caused by other problems on the home node (where you're trying to submit the job), on the remote parallel nodes, or in the communication subsystem that connects them. You need to make sure that all the things POE expects to be set up really are set up. Here's what you do:
$ whence poe
If you're a C shell user, type:
$ which poe
If the result is just the shell prompt, you don't have POE in your path. It might mean that POE isn't installed, or that your path doesn't point to it. Check that the file /usr/lpp/ppe.poe/bin/poe exists and is executable, and that your PATH includes the directory /usr/lpp/ppe.poe/bin.
$ env | grep MP_
Look at the settings of the environment variables beginning with MP_, (the POE environment variables). Check their values against what you expect, particularly MP_HOSTFILE (where the list of remote host names is to be found), MP_RESD (whether a job management system is to be used to allocate remote hosts) and MP_RMPOOL (the pool from which job management system is to allocate remote hosts) values. If they're all unset, make sure that you have a file named host.list in your current directory. This file must include the names of all the remote parallel hosts that can be used. There must be at least as many hosts available as the number of parallel processes you specified with the MP_PROCS environment variable.
$ poe -procs 1
You should get the following message:
0031-503 Enter program name and flags for each node: _
If you do get this message, POE has successfully loaded and established communication with the first remote host in your host list file. It has also validated your use of that remote host, and is ready to go to work. If you type any AIX command, for example, date, hostname, or env, you should get a response when the command executes on the remote host (like you would from rsh).
If you get some other set of messages, then the message text should give you some idea of where to look. Some common situations include:
The path to the remote host is unavailable. Check to make sure that you are trying to connect to the host you think you are. If you are using LoadLeveler to allocate nodes from a pool, you may want to allocate nodes from a known list instead. ping the remote hosts in the list to see if a path can be established to them. If it can, run rsh remote_host date to verify that the remote host can be contacted and recognizes the host from which you submitted the job, so it can send results back to you.
Check the /etc/services file on your home node, to make sure that the IBM Parallel Environment for AIX service is defined. Check the /etc/services and /etc/inetd.conf files on the remote host to make sure that the PE service is defined, and that the Partition Manager Daemon (pmd) program invoked by inetd on the remote node is executable.
You need an ID on the remote host and your ID on the home host (the one from which you are submitting the job) must be authorized to run commands on the remote hosts. You do this by placing a $HOME/.rhosts file on the remote hosts that identify your home host and ID. Brush up on Access if you need to. Even if you have a $HOME/.rhosts file, make sure that you are not denied access the /etc/hosts.equiv file on the remote hosts.
In some installations, your home directory is a mounted file system on both your home node and the remote host. On the SP, this mounted file system is managed by AMD, the AutoMount Daemon. Occasionally, during user verification, the AutoMount Daemon doesn't mount your home directory fast enough, and pmd doesn't find your .rhosts file. Check with your System Administrator... as long as you know that he doesn't bite.
Even if the remote host is actually the same machine as your home node, you still need an entry in the .rhosts file. Sorry, that's the way AIX authentication works.
On the home node, you can set or increase the MP_INFOLEVEL
environment variable (or use the -infolevel command line option) to get more information out of POE while it is running. Although this won't give you any more information about the error, or prevent it, it will give you an idea of where POE was, and what it was trying to do when the error occurred. A value of 6 will give you more information than you could ever want. See Appendix A, A sample program to illustrate messages for an example of the output from this setting.