IBM Books

Hitchhiker's Guide


Checkpointing and restarting a parallel program

Checkpointing a parallel program is a mechanism for temporarily saving the state of a parallel program at a specific point (checkpointing), and then later restarting it from the saved state. When you checkpoint a program, the checkpointing function captures the state of the application as well as all data, and saves it in a file. When the program is restarted, the restart function retrieves the application information from the file it saved. The program then starts running again from the place at which it was saved.

Limitations

When checkpointing a program, there are a few limitations of which you should be aware. You can find a complete list of the limitations in the MPI Programming Guide. For example, you can only checkpoint a POE job that is running an MPI application that is compiled with the threaded libraries (such as programs compiled with mpcc_r, mpCC_r, mpxlf_r, xpxlf90_r, or mpxlf95_r). LAPI programs can also be checkpointed if they meet the limitations.

How checkpointing and restarting works

A checkpoint can occur if you use the poeckpt command or when an application makes a call to the mpc_init_ckpt() function. The former is referred to as a system-initiated checkpoint, while the latter is referred to as user-initiated checkpoint. A system-initiated checkpoint of a job being run under LoadLeveler occurs when the llckpt command is issued.

For a system-initiated checkpoint, the applications are checkpointed at the point in their processing they happen to be when the checkpoint is issued. Checkpoint files are written for each task of the parallel application and for the POE executable itself. The names and locations of these files are controlled by the setting of the MP_CKPTFILE and MP_CKPTDIR environment variables.

For a user-initiated checkpoint, the application may specify whether all tasks must issue the checkpoint request before the checkpoint occurs, or that one task of the application may cause the checkpoint of all tasks (and POE) to occur. The former is called a complete user-initiated checkpoint, and the latter is called a partial user-initiated checkpoint. In a complete user-initiated checkpoint, each task executes the application up to the point of the mpc_init_ckpt() function call. In a partial user-initiated checkpoint, only one task executes the application up to the point of the mpc_init_ckpt() call, and the remaining tasks are checkpointed at whatever point in their processing they happen to be when the checkpoint occurs, as in a system-initiated checkpoint.

After a checkpoint of an interactive POE job has been taken, the poerestart command is used to restart the parallel application. POE is restarted first and it uses the saved information from its checkpoint file to identify the task checkpoint files to also restart. You can restart the application on the same set or different set of nodes, but the number of tasks and the task geometry must remain the same. When the restart function restarts a program, it retrieves the program state and data information from the checkpoint file. Note also that the restart function restores file pointers to the points at which the checkpoint occurred, but it does not restore the file content.

Since large data files are often produced as a result of checkpointing a program, you need to consider the amount of available space in your file system. You should also consider the type of file system. Writing and reading checkpointing files may yield better performance on Journaled File Systems (JFS) or General Parallel File Systems (GPFS) than on Networked File Systems (NFS), Distributed File Systems (DFS), or Andrew File Systems (AFS).

For more information on checkpointing limitations, see IBM Parallel Environment for AIX: MPI Programming Guide or IBM LoadLeveler for AIX: Using and Administering.

A Checkpoint/Restart Scenario

A user's parallel application has been running on two nodes for six hours when she is informed that the nodes must be taken down for service in an hour. She expects her application to run for three more hours, and does not want to have to restart the application from the beginning on different nodes. Luckily, the user had set the CHECKPOINT environment variable to yes before issuing the POE command, so that AIX would allow the checkpoint to occur. Furthermore, she had set the MP_CKPTDIR environment variable to a GPFS directory, /gpfs, so that the checkpoint files would be accessible from other nodes. She also sets the MP_CKPTFILE environment variable to the name of her application, 9hourjob, so she can easily identify it later.

After setting the MP_CKPTDIR and MP_CKPTFILE environment variables, she obtains the process identifier of the POE process. Then, she issues the poeckpt command, along with the -k option so that the tasks will be terminated once the checkpoints are successfully completed. The checkpoints of the parallel tasks are taken first, and then the checkpoint of POE occurs. The poeckpt command reports the following:

poeckpt: Checkpoint of POE process 12345 has succeeded. 
poeckpt: The /gpfs/9hourjob.0 checkpoint file has been created.

The filename indicated in the output, /gpfs/9hourjob, is the checkpoint file of the POE process which will be used later when the parallel application is restarted. The ".0" suffix is a tag used to allow one set of previously successful checkpoint files to be saved (a subsequent checkpoint on this program, although unlikely in this scenario, would use tag 1).

Being curious about the behavior of the checkpoint function, the user issues:

ls /gpfs/9hour*

and sees the following output:

/gpfs/9hourjob.0    /gpfs/9hourjob.0.0   /gpfs/9hourjob.1.0

The additional files besides the one reported by the output are the checkpoint files from each of the tasks that made up the parallel application. Note that the last '0' in the task checkpoint files represents the checkpoint tag as described previously. The digit before the tag is the task number within the parallel application.

The user finds two other nodes that she can use to restart her parallel job and sets up a host.list, containing these two hostnames, in the directory from which she will run the poerestart command. She issues:

poerestart /gpfs/9hourjob.0

The restarted POE from this checkpoint file "remembers" the names of the task checkpoint files to restart from, tells the Partition Manager Daemon on each node to restart each parallel task from their respective checkpoint file, and the parallel application is up and running again. The job completes in three hours, and produces the same results as it would have had it run for nine hours on the original nodes.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]