NCSA Home
Contact Us | Intranet | Search

data link Story: A Case Study in Porting and Optimization:
NCAR's MOZART

News
datalink
9903
Current issue
Archives

A Case Study in Porting and Optimization:
NCAR's MOZART

NCAR's MOZART (Model of Ozone Research in the Troposphere) global chemical transport model is used to simulate changes in the chemical composition of the troposphere. It can also be used to assess the impact of human-induced perturbations such as aircraft operations, fossil fuel combustion, and biomass burning. This 3D global chemistry simulation is a highly vectorized, multitasked code that typically runs at 95% efficiency on Cray vector supercomputers. This code was recently ported, tested, and analyzed for parallel performance on the NCSA SGI Origin2000 system by PECM's Mark Straka.

Straka's efforts are a good case study of how to assess a porting and optimizing project. His approach included three steps:

  1. replacing Cray directives with equivalent SGI and OpenMP directives (parallelization)
  2. reviewing the I/O cycle for unique or unusual situations that could impact optimization (parallel or large file I/O)
  3. breaking the code down into manageable parts to isolate the bottlenecks (single processor optimization)

The MOZART code came with existing Cray autotasking directives for parallelization of the main (90% execution) loop. Straka replaced these with the equivalent SGI native (C$DOACROSS) and OpenMP (C$OMP) directives. Five or six critical loops in the main program time step loop were addressed in this manner and comprised practically all of the execution time, minus the I/O effects. The performance of these two implementations proved to be identical within measurable accuracy.

The execution of long production MOZART runs requires about a Gbyte of disk to be available at any given time during the run because it cycles through many ~.5 Gbyte history files. The code currently makes use of system and shell calls to access mass storage (MSS). Initially, the scheme was to fetch the next file in the series upon I/O completion on the previous file. The drawback to this approach is that should a MSS retrieve operation fail, a failure of the code would occur when attempting I/O on a file that was not yet available. At best, a serious delay in execution would occur. Since then, members of the MOZART team have improved the robustness of the I/O routines and execution scripts to make this scenario much less likely.

The code as it arrived for the SGI Origin2000 was not being compiled with the fullest possible optimization. The -O3 compiler option was not being used because earlier debugging efforts revealed that a few places in the code were failing under this condition. Straka broke the code down from its monolithic file of hundreds of subroutines and thousands of lines of code using the fsplit utility and was able to systematically compile the component F90 routines with increasing levels of optimization. He then could isolate the fewer than six offending routines. The resultant code is approximately 20% faster than the original.

Straka's testing was only feasible to conduct on 5-minute, 5-step runs because of the huge number of repetitions needed and the costly I/O overhead involved. He passed his code to the MOZART team to test for positive numerical accuracy and ultimate speedup on long production runs. The specific cause of the failure of these various routines at high optimization has not been determined. Straka narrowed the problems down to a complicated code transformation issue because the offending routines are not time critical in the profile of the code. The PROD_LOSS subroutine takes the lion's share of the execution time. No feasible code transformations were apparent from studying this routine, as it is comprised almost entirely of expensive data motion operations. Other attempts at experimental code transformations in other key routines produced no performance improvements.