IBM Books

Hitchhiker's Guide


Can't start a parallel job

Once you have successfully compiled your program, you either invoke it directly or start the Parallel Operating Environment (POE) and then submit the program to it. In both cases, POE is started to establish communication with the parallel nodes. Problems that can occur at this point include:

These problems can be caused by other problems on the home node (where you're trying to submit the job), on the remote parallel nodes, or in the communication subsystem that connects them. You need to make sure that all the things POE expects to be set up really are set up. Here's what you do:

  1. Make sure that you can execute POE. If you're a Korn shell user, type:
    $ whence poe
    

    If you're a C shell user, type:

    $ which poe
    

    If the result is just the shell prompt, you don't have POE in your path. It might mean that POE isn't installed, or that your path doesn't point to it. Check that the file /usr/lpp/ppe.poe/bin/poe exists and is executable, and that your PATH includes the directory /usr/lpp/ppe.poe/bin.

  2. Type:
    $ env | grep MP_
    

    Look at the settings of the environment variables beginning with MP_, (the POE environment variables). Check their values against what you expect, particularly MP_HOSTFILE (where the list of remote host names is to be found), MP_RESD (whether a job management system is to be used to allocate remote hosts) and MP_RMPOOL (the pool from which job management system is to allocate remote hosts) values. If they're all unset, make sure that you have a file named host.list in your current directory. This file must include the names of all the remote parallel hosts that can be used. There must be at least as many hosts available as the number of parallel processes you specified with the MP_PROCS environment variable.

  3. Type:
    $ poe -procs 1
    

    You should get the following message:

      
         0031-503   Enter program name and flags for each node: _
     
    

    If you do get this message, POE has successfully loaded and established communication with the first remote host in your host list file. It has also validated your use of that remote host, and is ready to go to work. If you type any AIX command, for example, date, hostname, or env, you should get a response when the command executes on the remote host (like you would from rsh).

    If you get some other set of messages, then the message text should give you some idea of where to look. Some common situations include:


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]