HPC Scheduling

Project ID
FG-116
Project Categories
Computer Science
Completed
Abstract

Catalina is an open source external scheduler for use with resource managers such as LoadLeveler, PBS, and SLURM. It's capabilities include: system reservations, user reservations, user-settable reservations, standing reservations, priority-ordered queueing, and scheduling policies, Grid Universal Remote (GUR) and Master Control Program (MCP) are metascheduling open source tools that are being currently used in TeraGrid. Catalina has been used in production for almost a decade on NSF supercomputers (Blue Horizon, DataStar, SDSC IA64) for local scheduling. To accomodate network topologies (3D-torus, with leaf nodes) extension of Catalina with topology-aware scheduling would greatly benefit scheduling of workloads on 3D-torus machines. GUR is a tool for conducting multi-site,coordinated calculations. It has been used to create synchronized reservations across separate compute clusters to run single MPI jobs. Originally, GUR had the capability to stage data and remotely compile code. These capabilities are being revamped. GUR requires user-settable reservations on the local scheduler. Moab has this feature, as does the Catalina Scheduler. If virtualization is available, we would use that feature to bring up virtual Catalina clusters for testing and development. http://www.teragrid.org/userinfo/jobs/gur.php MCP is a command line utility that provides automatic resource selection for running a single parallel job on high performance computing resources. MCP optimizes job start times by submitting copies of a job and canceling the extra job request after one copy begins to execute. We would use FutureGrid to explore using MCP for ensemble runs. http://www.teragrid.org/userinfo/jobs/mcp.php We would like to use FutureGrid as a test bed to further develop Catalina/GUR/MCP with access to a wide variety of experimental and production architectures. Software requirements include Python 2.2 or greater (2.6.4 is current version), expect, C compiler, and openssh for communication. It would be nice to have the Remote Login kit from CTSS, but we can get by with openssh instead.

Use of FutureSystems
<p>Futuregrid resources will be used for development and testing of scheduling and metascheduling software. Futuregrid VMs serve as scheduling targets.</p>
Scale of Use
<p>A moderate to high number of very small footprint VMs to simulate an HPC cluster. (I'm not actually going to run work on the VMs, so they don't need a lot of cputime nor memory.)</p>