Performance and Parallelization Study
of A Configurational Bias Monte Carlo Code
Traditionally researchers use classical Molecular Dynamics or Monte Carlo
simulations to study the thermodynamic and statistical properties of macromolecular systems
such as proteins, membrane lipid bilayers, polymers etc. Both these simulation methods have
their advantages and disadvantages. One of the disadvantages of the Molecular Dynamics (MD)
method is the slow convergence towards equilibrium of the system under investigation.
Recently, researchers have developed hybrid methods to improve the rate of convergence
of MD simulations. One such hybrid method is called Configurational Bias Monte Carlo (CBMC).
In MD-CBMC, after several MD steps, a CBMC cycle is performed that drives the system towards
thermodynamic equilibrium.
Developing a Plan
Professor H. Larry Scott (Oklahoma State University) wanted to port and parallelize his CBMC code to NCSA's SGI Origin2000. The original serial code was written in F90/F77 (i.e., the code was mostly F77 but took advantage of Module and few other Expression syntax such as CASE features of F90). The code ran on a DEC Alpha system.
The project plan was to:
- Compile, test, and benchmark the serial code as is
- Turn on -O3 optimization and then test and benchmark the serial code
- Replace F77 array constructs with F90 constructs and repeat step 2
- Parallelize the code using OpenMP
The original code compiled smoothly and correctly on the Origin2000 and the test job took 254 seconds. Turning on the -O3 optimization reduced the computing time to 129 seconds. But the -O3 optimization resulted in round-off errors that the author was concerned about. These errors were eliminated by including -OPT:roundoff=0 in the compiler command, without any loss in performance.
More Advanced Changes
Because CBMC is repeatedly carried out tens of thousands of times, any improvement per CBMC step will save considerable CPU time over the length of an entire simulation. Therefore, it was necessary to figure out all the code-blocks that are compute intensive and to try to improve the performance of these code-blocks. Replacing the F77 array constructs to F90 helped condense the code and the concise code was easier to understand. Further, F90 array constructs allow the compilers to optimize the code much more efficiently.
Once the F90 array constructs were included, it was easier to evaluate the code's performance bottlenecks. For evaluating the performance of the code at subroutine level and then at the individual line level, we used SGI's SpeedShop package. Once the code was compiled normally, the following command was used to run the test case:
ssrun -v -pcsamp rundopc
where "rundopc" is the executable. This produces a "pcsamp" output file that can be examined using the command:
prof
With the help of SpeedShop, the performance of the serial code was improved to 58 seconds, more than a factor of 4 improvement over the original code.
Addressing Parallelization
In CBMC, unfortunately, it is not possible to parallelize the code over the number of steps (which typically number between 5000 and 10000 steps) because each step is dependent on the configuration generated by the previous step (i.e., single random walker taking 5000 or 10000 steps). Therefore, only the computations within each step can be parallelized and this means that we can only expect parallelism over 4 to 8 processors.
We adopted OpenMP as the method of parallelization. Because our project plan called for replacing
F77 constructs with F90 ones, we replaced the DO LOOPs with F90 statements that OpenMP currently
cannot parallelize. To parallelize the code, we had to reintroduce the DO LOOPs with the largest
range of indices. OpenMP's PARALLEL DO directives were placed around the reintroduced DO LOOPs.
The compiled code was tested to make sure that the parallelization did not introduce bugs. As
expected, the code's performance degraded after 4 processors, and on 4 processors the
test case took about 24 seconds (about twice as fast as one processor). To improve the parallel
performance we attempted compiler commands such as
-chunk=8 and
-mp_schedtype=DYNAMIC that improved
the performance a little but not a lot. However, every improvement resulted in significant saving in
the CPU time over the entire length of the simulation.
Impact of the Plan
This improved performance allows the Oklahoma State researchers to carry out
long-time MD-CBMC simulations of biological lipid bilayer membranes. Also,
the parallel performance of the code is expected to improve as the size of the
membrane system increases, thus introducing more computational work in the parallel
regions of the code.
--Balaji
Veeraraghavan
03/99
All brand and product names are trademarks or registered trademarks
of their respective holders.