Optimizing Shared Resource Contention in HPC Clusters

Project Details

Project Lead
Sergey Blagodurov 
Project Manager
Sergey Blagodurov 
Supporting Experts
Gregor von Laszewski  
Institution
Simon Fraser University, School of Computing Science  
Discipline
Computer Science (401) 

Abstract

Contention for shared resources in HPC clusters occurs when jobs are concurrently executing on the same multicore node (there is a contention for allocated CPU time, shared caches, memory bus, memory controllers, etc.) and when jobs are concurrently accessing cluster interconnects as their processes communicate data between each other. The cluster network also has to be used by the cluster scheduler in a virtualized environment to migrate job virtual machines across the nodes. We argue that contention for cluster shared resources incurs severe degradation to workload performance and stability and hence must be addressed. We also found that the state-of-the-art HPC cluster schedulers are not contention-aware. The goal of this work is the design, implementation and evaluation of a scheduling framework that optimizes shared resource contention in a virtualized HPC cluster environment.

Intellectual Merit

The proposed research demonstrates how the shared resource contention in HPC clusters can be addressed via contention-aware scheduling of HPC jobs. The proposed framework is comprised of a novel scheduling algorithm and a set of Open Source software that includes the original code and patches to the widely-used tools in the field. The solution (a) allows an online monitoring of the cluster workload and (b) provides a way to make and enforce contention-aware scheduling decisions on practice.

Broader Impacts

This research suggests a way to upgrade the HPC infrastructure used by U.S. academic institutions, industry and government. The goal of the upgrade is a better performance for general cluster workload.

Scale of Use

Book 16-64 nodes from any FutureGrid resource that supports HPC (e.g. india, alamo, etc.) in exclusive mode for several hours.

Results

Accepted publications:

Tyler Dwyer, Alexandra Fedorova, Sergey Blagodurov, Mark Roth, Fabien Gaud and Jian Pei,
A Practical Method for Estimating Performance Degradation on Multicore Processors and its
Application to HPC Workloads, in Supercomputing Conference (SC), 2012. Acceptance rate 21%.
MAS rank: 51/2872 (top 2%)
http://www.sfu.ca/~sba70/files/sc12.pdf

Presented posters:

Sergey Blagodurov, Alexandra Fedorova, Fabien Hermenier, Clavis-HPC: a Multi-Objective Virtualized Scheduling Framework for HPC Clusters, in OSDI 2012.

Public software releases:

Clavis-HPC: a multi-objective virtualized scheduling framework for HPC clusters.
http://hpc-sched.cs.sfu.ca/

The source code is available for download from github repository:
https://github.com/blagodurov/clavis-hpc

Documentation:

Below is the link to our project report for the FutureGrid Project Challenge. A shorter version of it will appear in HPCS 2012 proceedings as a Work-In-Progress paper:
http://www.sfu.ca/~sba70/files/report188.pdf

A very brief outline of the problem, the framework and some preliminary results:
http://www.sfu.ca/~sba70/files/ClusterScheduling.pdf