NCSA Home
Contact Us | Intranet | Search

SAP Project:

Grid-Enabling Biomolecular Simulation with NAMD

Klaus Schulten
Beckman Institute for AdvancedScience and Technology
University of Illinois at Urbana-Champaign

Research Objectives
SCIENTIFIC GOALS
Proteins are the principal actors in the functioning of biological cells. The mechanism of any particular protein is enabled by its native three-dimensional structure, which is determined by the sequence of amino acids that form the protein chain. The structure of over 20,000 proteins are known today, and it has been noted that proteins of similar function often have remarkably similar structures despite divergent sequences. The larger, overriding goal of this project is to understand the relationship of the structure and dynamics of a protein to its function.

Specifically, we wish to understand the flexibility, compressibility, and average electrostatics of entire functional classes of proteins for which many structures are available. The classes chosen are oxygen-carrying proteins (myoglobins and hemoglobins), proteins with oxygen substrates (oxygenases), and similar proteins with heme groups (cytochromes-c). This project has been proposed with the Pittsburgh Supercomputing Center to the NSF as a joint US-UK technology demonstration for SC2005.

COMPUTATIONAL GOALS AND METHODS
The method of classical molecular dynamics simulation is of demonstrated value for modeling the large and small biomolecular motions responsible for the properties of interest outlined above; the method breaks down only when molecular bonds must be made or broken, or when the size or duration of the required simulation exceeds the available resources. Just as our program NAMD has demonstrated its ability to complete single large simulations in record time on a single parallel cluster or machine, we seek now to make a major stride in employing grid technology to rapidly execute and analyze modest simulations of entire protein classes. An additional goal is to drive the availability of on-demand computational resources that can grant immediate short-term access to up to, e.g., 128 processors for simulation setup, testing, and interactive steering.

POTENTIAL BENEFITS
The proposed SC2005 simulations will compare functionally related proteins from all domains of life (procariots, eucariots, and archea) and promises fundamental insight into the evolution of cellular processes through the integration of dynamics into comparative protein biology. The particular process chosen is medically relevant in the conduction of gasses such as hydrogen, oxygen, and nitric oxide. The integration of grid technology and on-demand resources into the NAMD ecosystem will increase the usability and impact of molecular dynamics simulation on basic biomedical research.

COMPUTATIONAL APPROACH
The initial strategy suggested is to use Condor-G to launch and monitor the large numbers of simulations and analyses required for dynamics surveys of the type planned for SC2005. Scheduling large numbers of parallel NAMD simulations may require other approaches. In addition, we see the SGI Altix as an ideal on-demand platform, since its large memory and single system image would facilitate the temporary suspension of long-running jobs when urgent on-demand computation is required.overhead.

ACCOMPLISHMENTS AND SIGNIFICANCE
Professor Schulten leads the NIH Resource for Macromolecular Modeling and Bioinformatics, providing the programs VMD, NAMD, and BioCoRE to the biomedical research community free of charge. NAMD was awarded a Gordon Bell Award at SC2002 for unprecedented scaling on a challenging problem, while VMD has become the visualization software of choice even among the developers of competing molecular dynamics programs. Resource personnel engage in collaborations with many leading experimental labs, producing a steady stream of discoveries. See additional highlights.

PUBLICATIONS
Jordi Cohen, Kwiseon Kim, Matthew Posewitz, Maria L. Ghirardi, Klaus Schulten, Michael Seibert, and Paul King. Molecular dynamics and experimental investigation of H2 and O2 diffusion in [Fe]-hydrogenase. Biochemical Society Transactions, 33:80-82, 2005.

James C. Phillips, Gengbin Zheng, Sameer Kumar, and Laxmikant V. Kal. NAMD: Biomolecular Simulation on Thousands of Processors. Proceedings of the IEEE/ACM SC2002 Conference. IEEE Press, 2002. Technical Paper 277.

John Stone, Justin Gullingsrud, Paul Grayson, and Klaus Schulten. A system for interactive molecular dynamics simulation. In John F. Hughes and Carlo H. Squin, editors, 2001 ACM Symposium on Interactive 3D Graphics, pp. 191-194, New York, 2001. ACM SIGGRAPH.

Laxmikant Kal, Robert Skeel, Milind Bhandarkar, Robert Brunner, Attila Gursoy, Neal Krawetz, James Phillips, Aritomo Shinozaki, Krishnan Varadarajan, and Klaus Schulten. NAMD2: Greater scalability for parallel molecular dynamics. Journal of Computational Physics, 151:283-312, 1999.

BioCoRE: A collaboratory for structural biology. Milind Bhandarkar, Gila Budescu, William F. Humphrey, Jesus A. Izaguirre, Sergei Izrailev, Laxmikant V. Kal, Dorina Kosztin, Ferenc Molnar, James C. Phillips, and Klaus Schulten. In Agostino G. Bruzzone, Adelinde Uchrmacher, and Ernest H. Page, editors, Proceedings of the SCS International Conference on Web-Based Modeling and Simulation, pages 242-251, San Francisco, California, 1999.

William Humphrey, Andrew Dalke, and Klaus Schulten. VMD - Visual Molecular Dynamics. Journal of Molecular Graphics, 14:33-38, 1996.

Many other papers are associated with this project.

 

Status Report
April 6, 2006

In 2005, the Theoretical and Computational Biophysics Group (TCBG) at the Beckman Institute initiated an effort to expand the capabilities of its suite of software for molecular modeling and visualization by investigating the inclusion of the latest generation of technologies for transparent, seamless access to distributed computational resources collectively referred to as a "Computational Grid".  This new project, called NAMD-G, has resulted in an initial implementation of a solution for management of NAMD-based molecular dynamics simulations over an instance of a leading-edge computational grid, specifically the National Science Foundation (NSF) TeraGrid facility.

Researchers within TCBG and their colleagues have long been substantial users of the HPC systems and services provided by the NSF supercomputer centers program, with ongoing large peer-reviewed allocations of time on the latest computational platforms made available through centers such as the National Center for Supercomputing Applications (NCSA), the San Diego Supercomputer Center (SDSC), and the Pittsburgh Supercomputer Center (PSC).

TCBG-developed software such as the molecular dynamics code NAMD has enjoyed increasing usage not only by TCBG but also by a growing international community that can benefit from the excellent parallel scalability that NAMD has been shown to achieve. As an example of the aggregate computational requirement for NAMD-based simulations, during a six-month period spanning from April through September 2005, NAMD was the Number 1-ranked application at NCSA based on service units delivered on NCSA production systems.  During the same period, NAMD-driven simulations accounted for 70% of the total usage of NCSA's most recent platform "Cobalt", an SGI Altix 3000 system configured with 1,024 Itanium 2 processors.

Even with the availability of state-of-the-art application software packages such as NAMD that are explicitly designed to efficiently use hundreds of processors to conduct molecular dynamics simulations on systems comprised of hundreds of thousands of atoms, there remains an intensive human-directed effort that is necessary to manage every technical aspect of each individual investigation.  Regardless of the size of the system of interest, researchers must prepare at a minimum input data files (e.g., protein data bank, structure, force field parameters) as well as a NAMD-specific configuration appropriate for the target simulation.  Solvation, minimization, equilibration, as well as the MD simulation itself will be directed, monitored, and analyzed by the researcher and may involve management of a collection of input and output data files that can easily exceed hundreds of gigabytes.  These files may reside locally (on the researcher's "home" system) or may be created on remote supercomputers many thousands of miles away.  As the calculations are ongoing, the researcher is faced with a management problem necessitated by the policies of the organization providing the compute facilities.  For example, it is common for high performance computing (HPC) sites to limit (or forbid) interactive use of their supercomputers; instead users must prepare and submit batch jobs and queue them for execution at a later time when compute resource become available.  It is common for the queueing systems to differ considerably among sites, requiring the user to become familiar with a system-specific batch command language to construct acceptable run scripts for each individual site.  Additionally, sites configure and manage on-line (rotating) disk as well as persistent, long-term storage that act as repositories for programs and data; it is up to the users to learn specific commands and filesystem layouts that best meet their requirements.

In 2001, NSF launched the "TeraGrid" project, coordinated through the University of Chicago, which now comprises eight partner sites. The overall goal of the TeraGrid is to provide an open scientific discovery infrastructure combining leadership class resources located at each partner site that are integrated into a persistent computational resource.  As a whole, TeraGrid currently provides over 40 petaflops of computing power, nearly 2 petabytes of rotating storage, and specialized data analysis and visualization resources in production, interconnected at 10-30 gigabits/second via a dedicated national network.

The extensive collection of TeraGrid-provided hardware, however, does not in itself accomplish its stated mission, which is to present state-of-the-art computational resources to researchers not as a disjoint, loosely-connected set of distributed hardware but instead as a coherent, transparent, "virtual machine and resource room" whose facilities are as easily accessible to the scientific researcher as the workstation installed on his or her desktop or the close-by departmental-level computing cluster.

It is the task of what is commonly referred to as "middleware" software and services to help bridge the gap between what are literally standalone computational resources, independently managed and controlled by distinct organizations, and the desired *presentation* of these resources *as if* they were a single entity.  Included in the set of necessary functionalities and services that "Grid"-targeted middleware software should provide are:

  • security and authentication
  • job management (submission and monitoring)
  • data management
  • discovery

Currently, no single standard has been defined that can serve as a common base for implementations of Grid middleware toolkits.  Instead, several offerings from commercial and open source efforts are available and in wide use, with examples including Sun Microsystem's Grid Engine, the Open Middleware Infrastructure Institute (OMII) repository, the Legion project from the University of Virginia, Apple Computer's Xgrid, and the Globus Toolkit from The Globus Alliance. Each of these products share the common goal of providing a basic set of utilities and services that can be used as building blocks upon which higher-level applications that use distributed resources can be layered.  Within the TeraGrid environment, the Globus Toolkit has been selected as the middleware of choice due to its widespread acceptance as well as the active participation of TeraGrid partners within the Globus Alliance itself.

Regardless of the specific choice of middleware, it is important to recognize that none of the available products or projects provide complete *end-to-end* solutions that are applicable to activities routine in accomplishing domain-specific scientific research.  This is not a failing of their design but simply a consequence of the infinite space of possible activities that can be usefully pursued within distributed computational environments.  Rather, it is more prudent to rely on the knowledge, expertise, and resources within individual communities to define what meaningfully constitutes a "work process", or "workflow" in the context of their research.  It is this follow-on effort, known as NAMD-G, that the TCBG initiated in 2005.

The NAMD-G project is the result of a close collaboration between TCBG scientists and technical staff from NCSA, each contributing distinct areas of expertise to the effort.  The collaboration was initially motivated by posing exploratory questions such as, "we have heard a great deal about the Grid, but how can we learn how to best make use of this seemingly-powerful resource? How can the Grid benefit us in our daily research?".  Further discussions and brainstorming between TCBG and NCSA served to clarify details and goals for each side. For TCBG personnel, the current state-of-practice within the TeraGrid community was clarified. For NCSA, specific needs and goals regarding computational workflow for the biomolecular science community were provided and a future direction and vision most relevant to improving the efficiency with which the community conducts its research were outlined.

Among the initial goals for a NAMD-G prototype that arose from initial discussions were to use Grid technologies to:

  1. Take advantage of single-signon authentication and access to TeraGrid computational resources.  It is common for researchers who use multiple supercomputing centers to be faced with the burden of managing multiple logins, passwords, and shell environments.
  2. Provide a transparent interface to high-speed data management and file transfer utilities available within the TeraGrid software stack. Design a strategy that will intelligently place data *where it is needed* - in some cases, this involves moving data from remote supercomputers to mass storage systems, in others it is preferable to directly place results on the researcher's desktop.
  3. Design a workflow model tailored to the way that the biomolecular researcher actually proceeds, offloading the tasks of frequent monitoring of simulations, managing the details of job submission and restarts through the NAMD-G system rather than requiring the scientist to hand-monitor in-progress calculations.

Based on these goals and further directed by the immediate needs of in-progress research being conducted by TCBG staff, the NAMD-G prototype evolved during 2005 with the following specific components chosen to meet the initial requirements:

Authentication: Grid Security Infrastructure (GSI) is a portion of the Globus Toolkit that uses public key cryptography as the basis for secure authentication and communication. GSI includes a delegation capability as an extension of the standard Secure Sockets Layer (SSL) protocol through the use of proxy certificates that fulfills the single-signon requirement for NAMD-G.  It also enables NAMD-G to authenticate to remote machines on behalf of the user without needing user interaction or storing passwords for the remote machines.  Further, NAMD-G makes use of the NCSA-developed MyProxy credential repository which provides simplified management of security credentials.  Proxies with shorter life-spans are more secure than longer term proxies.  However, short-term proxies require the user to acquire new proxies more often. Condor, middleware from the University of Wisconsin-Madison, can use the MyProxy credential repository to securely renew a proxy for a user a short time before the proxy expires without requiring user interaction. NAMD-G takes advantage of this capability of Condor and MyProxy to ensure that the researcher has a valid proxy for the lifetime of his simulation.

Data Management: NAMD-G uses the NCSA-developed uberFTP client for file transfers, which also uses GSI authentication and therefore provides the same advantage of limited need for password entry by users.  uberFTP can also automatically wait for files to be transfered from tape to disk as is the case when retrieving files from NCSA's mass storage system.

Workflow/Job Management: The Condor middleware also includes two components that were chosen for use within NAMD-G to handle remote job management and the ordering of a sequence of jobs.  Remote job submission is managed through Condor-G, which is a "Globus-aware" subset of the more general Condor system that is well-suited for NAMD-G. Part of the requirements of being a TeraGrid system is running Globus services that allow jobs to be submitted remotely.  Condor-G can submit batch jobs to these Globus services and monitor the jobs' status. By using Globus and Condor-G, the researcher (and NAMD-G itself) does not have to spend time learning the syntax of every remote batch system and modifying batch scripts when using different remote machines.  Another component of Condor called DAGMan (Directed Acyclic Graph Manager) provides the functionality used by NAMD-G to automate a sequence of runs that take a simulation from start to finish (such as an equilibration run followed by the simulation proper).

The workflow capabilities of NAMD-G go beyond simply running a sequence of runs specified by the researcher.  It is nearly always the case that executing a simulation from start to finish requires substantially more total computing time than is available within limits imposed by the remote batch system. In the absence of a system like NAMD-G, the researcher is forced to periodically check the status of submitted jobs, verify that the simulation is progressing satisfactorily, and restart the simulation from an intermediate checkpointed state.  This process has to be repeated until the total desired simulation time has been reached.

In practice, managing remote batch jobs is an extremely time-consuming activity required of the researcher that amounts to "baby-sitting" a simulation through to completion.  The time spent in these mundane activities is therefore time unavailable to perform far more useful tasks. The NAMD-G system removes the burden of this from the researcher and to a great extent automates the entire process from beginning to end, monitoring progress, handling repeated remote batch job submissions, and finally transferring all files necessary both to and from their desired locations.

In addition to these core features of NAMD-G and its supporting software, attention is paid to implementing features helpful to the researcher using the system.  For example, NAMD-G takes steps to ensure that the researcher is informed about the progress of an ongoing simulation by sending email to notify the user that a portion of a simulation has completed.  The email also provides an excerpt from the standard output of NAMD to help the user stay informed as to the current state of the simulation and to alert the user of possible problems that might require intervention.