IBM Books

IBM LoadLeveler for AIX 5L: Using and Administering


Table of Contents

  • Who Should Use This Book

  • How this Book is Organized
  • Typographic Conventions
  • Related Information
  • Information Formats
  • Accessing This Book off the World Wide Web
  • Accessing LoadLeveler Documentation Online
  • LoadLeveler Man Pages
  • What's New in 3.1
  • Book reorganization
  • Submitting jobs that use striping
  • Integration with AIX Workload Manager
  • Gang scheduling
  • Checkpoint/Restart
  • Support for 64-bit applications
  • File system monitoring
  • Migration Considerations
  • Moving From 2.1 to 2.2
  • Keyword Added to Administration File
  • Changes in LoadLeveler Command Output
  • Moving from 2.2 to 3.1
  • Interaction with AIX Workload Manager
  • Checkpoint considerations
  • Gang scheduling considerations
  • LoadLeveler 2.2 and 3.1 coexistence

  • LoadLeveler overview

  • Overview summary

  • What is LoadLeveler?
  • LoadLeveler basics
  • How LoadLeveler works
  • Network job management and job scheduling systems
  • How LoadLeveler schedules jobs
  • LoadLeveler daemons
  • The LoadLeveler job cycle
  • Consumable resources

  • LoadLeveler interfaces

  • Interface summary

  • LoadLeveler command line interface
  • Summary of LoadLeveler commands
  • Using the Graphical User Interface
  • Starting the Graphical User Interface
  • Specifying options
  • The LoadLeveler main window
  • Getting help using the Graphical User Interface
  • Differences between LoadLeveler's Graphical User Interface and other Graphical User Interfaces
  • Graphical User Interface typographic conventions
  • Customizing the Graphical User Interface
  • Syntax of an Xloadl file
  • Modifying windows and buttons
  • Creating your own pull-down menus
  • Customizing fields on the Jobs window and the Machines window
  • Modifying help panels
  • Administrative uses for the Graphical User Interface
  • Job related administrative actions
  • Machine related administrative actions
  • LoadLeveler API interface
  • Summary of LoadLeveler APIs

  • User tasks

  • User task summary

  • Submitting and managing jobs
  • Building a job command file
  • Job command file syntax
  • Submitting a job command file
  • Managing jobs
  • Editing job command files
  • Querying the status of a job
  • Placing and releasing a hold on a job
  • Cancelling a job
  • Checkpointing a job
  • Setting and changing the priority of a job
  • Working with machines
  • Run-time environment variables
  • Managing jobs that consume resources
  • Specifying the consumption of resources by a job step
  • Displaying currently available resources
  • Special considerations for parallel jobs
  • Supported parallel environments
  • Keyword considerations for parallel jobs
  • Scheduler considerations
  • Task assignment considerations
  • Submitting jobs that use striping
  • Understanding striping
  • Using striping
  • Running interactive POE jobs
  • Job command file examples
  • Obtaining status of parallel jobs
  • Obtaining allocated host names

  • Administrator tasks

  • Administrator task summary

  • Administering and configuring LoadLeveler
  • Overview
  • Planning considerations
  • Where to begin?
  • Quick set up
  • Administering LoadLeveler
  • Administration file structure and syntax
  • Configuring LoadLeveler
  • The configuration files
  • Configuration file structure and syntax
  • Considerations for integrating LoadLeveler with AIX Workload Manager
  • Keyword summary
  • Administration tasks for parallel jobs
  • Scheduling considerations for parallel jobs
  • Allowing users to submit interactive POE jobs
  • Allowing users to submit PVM jobs
  • Restrictions and limitations for PVM jobs
  • Setting up a class for parallel jobs
  • Setting up a parallel master node
  • Gathering job accounting data
  • Collecting job resource data on serial and parallel jobs
  • Collecting job resource data based on machines
  • Collecting job resource data based on events
  • Collecting job resource information based on user accounts
  • Collecting the accounting information and storing it into files
  • Accounting reports
  • Job accounting setup procedure
  • Routing jobs to NQS machines
  • Setting up the NQS environment
  • Designating machines to which jobs will be routed
  • NQS scripts
  • NQS machine job routing procedure

  • Detailed descriptions

  • Descriptions summary

  • Job command file keywords
  • account_no
  • arguments
  • blocking
  • checkpoint
  • ckpt_dir
  • ckpt_file
  • ckpt_time_limit
  • class
  • comment
  • core_limit
  • cpu_limit
  • data_limit
  • dependency
  • environment
  • error
  • executable
  • file_limit
  • group
  • hold
  • image_size
  • initialdir
  • input
  • job_cpu_limit
  • job_name
  • job_type
  • max_processors
  • min_processors
  • network
  • node
  • node_usage
  • notification
  • notify_user
  • output
  • parallel_path
  • preferences
  • queue
  • requirements
  • resources
  • restart
  • restart_from_ckpt
  • restart_on_same_nodes
  • rss_limit
  • shell
  • stack_limit
  • startdate
  • step_name
  • task_geometry
  • tasks_per_node
  • total_tasks
  • user_priority
  • wall_clock_limit
  • Job command file variables
  • Example 1
  • Example 2
  • Administration and Configuration file keywords
  • Administration file keywords
  • Configuration file keywords and LoadLeveler variables
  • Keywords
  • User-defined keywords
  • LoadLeveler variables
  • LoadLeveler daemons and job states
  • Daemons
  • The master daemon
  • The schedd daemon
  • The startd daemon
  • The negotiator daemon
  • The kbdd daemon
  • The gsmonitor daemon
  • Job states
  • Commands
  • llacctmrg - Collect machine history files
  • llcancel - Cancel a submitted job
  • llckpt - Checkpoint a running job step
  • llclass - Query class information
  • llctl - Control LoadLeveler daemons
  • lldcegrpmaint - LoadLeveler DCE group maintenance utility
  • llextSDR - Extract adapter information from the SDR
  • llfavorjob - Reorder system queue by job
  • llfavoruser - Reorder system queue by user
  • llhold - Hold or release a submitted job
  • llinit - Initialize machines in the LoadLeveler cluster
  • llmatrix - Query Gang matrix
  • llmodify - Change attributes of a submitted job step
  • llpreempt - Preempt a submitted job step
  • llprio - Change the user priority of submitted job steps
  • llq - Query job status
  • llstatus - Query machine status
  • llsubmit - Submit a job
  • llsummary - Return job resource information for accounting
  • Application Programming Interfaces (APIs)
  • Accounting API
  • Account validation user exit
  • Report generation subroutine
  • Checkpointing API
  • ckpt subroutine
  • ll_init_ckpt
  • ll_ckpt
  • ll_set_ckpt_callbacks
  • ll_unset_ckpt_callbacks
  • Data Access API
  • Using the data access API
  • ll_query subroutine
  • ll_set_request subroutine
  • ll_reset_request subroutine
  • ll_get_objs subroutine
  • Understanding the LoadLeveler job object model
  • ll_get_data subroutine
  • ll_next_obj subroutine
  • ll_free_objs subroutine
  • ll_deallocate subroutine
  • Examples of using the Data Access API
  • Error Handling API
  • ll_error subroutine
  • Parallel Job API
  • Interaction between LoadLeveler and the parallel API
  • ll_get_hostlist subroutine
  • ll_start_host subroutine
  • Examples
  • Query API
  • ll_get_jobs subroutine
  • ll_free_jobs subroutine
  • ll_get_nodes subroutine
  • ll_free_nodes subroutine
  • Submit API
  • llsubmit subroutine
  • llfree_job_info subroutine
  • Monitoring programs
  • Workload Management API
  • ll_control subroutine
  • ll_modify subroutine
  • ll_preempt subroutine
  • ll_start_job subroutine
  • ll_terminate_job subroutine
  • Usage notes
  • User exits
  • Handling DCE security credentials
  • Handling an AFS token
  • Filtering a job script
  • Using your own mail program
  • Writing prolog and epilog programs
  • Procedures
  • Using the Graphical User Interface
  • Step 1: Building jobs
  • Step 2: Edit the job command file
  • Step 3: Submit a job command file
  • Step 4: Display, refresh, and obtain job status
  • Step 5: Sort the Jobs window
  • Step 6: Change priorities of jobs in a queue
  • Step 7: Hold a job
  • Step 8: Release a hold on a job
  • Step 9: Cancel a job
  • Step 10: Modify consumable CPUs and consumable memory
  • Step 11: Take checkpoint
  • Step 12: Display and refresh machine status
  • Step 13: Sort the Machines window
  • Step 14: Find the location of the central manager
  • Step 15: Find the location of the public scheduling machines
  • Step 16: Find the type of scheduler in use
  • Step 17: Specify which jobs appear in the Jobs window
  • Step 18: Specify which machines appear in Machines window
  • Step 19: Save LoadLeveler messages in a file
  • Customizing the administration file
  • Step 1: Specify machine stanzas
  • Step 2: Specify user stanzas
  • Step 3: Specify class stanzas
  • Step 4: Specify group stanzas
  • Step 5: Specify adapter stanzas
  • Customizing the global and local configuration file
  • Step 1: Define LoadLeveler administrators
  • Step 2: Define LoadLeveler cluster characteristics
  • Step 3: Define LoadLeveler machine characteristics
  • Step 4: Define consumable resources
  • Step 5: Specify how many jobs a machine can run
  • Step 6: Prioritize the queue maintained by the negotiator
  • Step 7: Prioritize the order of executing machines maintained by the negotiator
  • Step 8: Manage a job's status using control expressions
  • Step 9: Define job accounting
  • Step 10: Specify alternate central managers
  • Step 11: Specify where files and directories are located
  • Step 12: Record and control log files
  • Step 13: Define network characteristics
  • Step 14: Enable checkpointing
  • Planning considerations for checkpointing jobs
  • How to checkpoint a job
  • Remove old checkpoint files
  • Step 15: Specify process tracking
  • Step 16: Configuring LoadLeveler to use DCE security services
  • Step 17: Specify additional configuration file keywords
  • Setting up job accounting files
  • Task 1: Update the configuration file
  • Task 2: Merge multiple files collected from each machine into one file
  • Task 3: Report job information on all the jobs in the history file
  • Task 4: Using account numbers and setting up account validation
  • Task 5: Specifying machines and their weights
  • Routing jobs to NQS machines
  • Task 1: Modify the administration file
  • Task 2: Modify the configuration file
  • Task 3: Submit the jobs
  • Task 4: Obtain status of NQS jobs
  • Task 5: Cancel NQS jobs
  • Using Gang scheduling
  • Overview
  • Gang scheduling concepts
  • Hierarchical communication
  • Task switching
  • Supported hardware
  • Application support
  • Preemption
  • Keywords specific to Gang scheduling
  • Configuration file keywords for Gang scheduling
  • Sample configuration file
  • Administration file keywords for Gang
  • Sample administration file
  • Gang scheduling interactions and restrictions
  • Network Time Protocol (NTP)
  • Consumable resource enforcement
  • Reconfiguration
  • Circular preemption
  • Restrictions for Gang scheduling and preemption
  • Implied START_CLASS values
  • Last one wins rule
  • Job command file and Gang scheduling
  • LoadLeveler commands for Gang
  • APIs used with Gang scheduling
  • Support for 64-bit applications
  • 64-bit support for Job Command, Configuration, and Administration keywords
  • 64-bit support for Job Command file keywords
  • 64-bit support for Administration keywords
  • 64-bit support for Configuration keywords and expressions
  • 64-bit support for Command line interfaces and the GUI
  • 64-bit support for Command line interfaces
  • 64-bit support for the GUI
  • 64-bit support for the LoadLeveler APIs
  • 64-bit support for Accounting functions
  • Appendix contents

  • Appendixes

  • Appendix A. Examples
  • User tasks: building job command files
  • Using commands
  • Additional examples of building job command files
  • User tasks: building parallel job command files
  • POE
  • PVM 3.3 (non-SP)
  • PVM 3.3.11+ (SP2MPI architecture)
  • Appendix B. Customer case studies
  • Customer 1: technical computing at the Cornell Theory Center
  • System configuration
  • LoadLeveler configuration
  • Customer 2: circuit simulation
  • System configuration
  • LoadLeveler configuration
  • Customer 3: high-energy physics
  • System configuration
  • LoadLeveler batch configuration
  • LoadLeveler interactive configuration
  • Processor configuration
  • Customer 4: computer chip design
  • System configuration
  • Interactive configuration
  • Batch configuration
  • Configuration for a machine that schedules (but doesn't run) jobs
  • Appendix C. Troubleshooting
  • Troubleshooting LoadLeveler
  • Frequently Asked Questions
  • Helpful hints
  • Getting help from IBM
  • Appendix D. Bibliography
  • Information formats
  • Finding documentation on the World Wide Web
  • Accessing PSSP documentation online
  • Manual pages for public code
  • RS/6000 SP planning publications
  • RS/6000 SP hardware publications
  • RS/6000 SP Switch Router publications
  • Related hardware publications
  • RS/6000 SP software publications
  • AIX publications
  • DCE publications
  • Redbooks
  • Non-IBM publications
  • Appendix E. Notices
  • Trademarks and service marks
  • Appendix F. Glossary
  • Appendix . Index

  • [ Top of Page | Previous Page | Next Page | Table of Contents | Index ]