Exploring HPC Fault Tolerance on the Cloud

Project Details

Project Lead
Hui Jin 
Project Manager
Hui Jin 
Project Members
Xi Yang  
Institution
Illinois Insitute of Technology, Computer Science Department  
Discipline
Computer Science (401) 
Subdiscipline
11.07 Computer Science 

Abstract

The potential of running traditional High-Performance Computing (HPC) applications on the emerging cloud computing environment has been gained intensive attractions recently. While most existing studies focus on the performance, scalability and productivity of the HPC applications on the cloud, the reliability issue, however, has been rarely studied in the context. The reliability in HPC community has already been recognized as a critical challenge in limiting the scalability, especially for the upcoming exascale computing environment. The cloud environment introduces more complicated software architecture and exposes the system with a higher risk of failures. Distinctive applications running on different virtual machines may share the same physical resource, which causes resource contention, application interaction and threatens the reliability further. Checkpointing is the mostly used mechanism to support fault tolerance in HPC applications. While checkpointing and its behaviors have been well studied on high-end computing systems (HEC), there is limited study on evaluating the impact of cloud on the performance of checkpointing. Checkpointing on the cloud may present distinctive features differentiating it from the traditional HEC checkpointing. For example, in the cloud, hard disk of one physical machine is shared by multiple virtual machines, which burdens the performance since the checkpointing requests are usually issued in a burst. Also, if the checkpointing images go to the dedicated storage service in the cloud, more variations may be presented due to the fact that multiple virtual machines of one node share one NIC and makes the performance difficult to predict. The objective of this project is to study the reliability implications of running HPC applications on the cloud. As the first stage of the project, we will focus on evaluating the performance of parallel checkpointing on the cloud. On the FG platform, we will scale the application size of our previous experiments with higher degree. The performance will be tested on both local storage of each virtual machine and the shared storage.

Intellectual Merit

This project will evaluate the performance results of parallel checkpointing, observe the bottlenecks and propose solutions for the potential problems. The project will deliver a paper submission to a referenced conference and be part of the Ph.D thesis of Hui Jin.

Broader Impacts

This project will present a comprehensive study on the reliability potentials and issues for the emerging cloud computing environment. The project will provide guideline in designing and deploying HPC application on the cloud from the perspective of reliability. The project will also proposal possible solutions to boosting the application performance under failures for the cloud environment.

Scale of Use

We expect to reach scales of at least 2048 VM instances and evaluate the corresponding checkpointing performance. The initial part of the project is planned to last about 3 months, with an estimated total CPU hours required of 409600 (2048vms*4hrs*50runs).

Results

1 Overview
The potential of running traditional High-Performance Computing (HPC) applications on the emerging cloud computing environment has been gained intensive attractions recently. ScienceCloud [1] is the workshop that dedicates to study the potentials of scientific computing on the cloud environment. Furthermore, the projects such as FurtureGrid and Magellan has been created to as a testbeds to study the potential of science on the cloud.

While most existing studies focus on the performance, scalability and productivity of the HPC applications on the cloud, the reliability issue, however, has been rarely studied in the context. The reliability in HPC community has already been recognized as a critical challenge in limiting the scalability, especially for the upcoming exascale computing environment [2, 3]. The cloud
environment introduces more complicated software architecture and exposes the system with a higher risk of failures. Distinctive applications running on different virtual machines may share the same physical resource, which causes resource contention, application interaction and threatens the reliability further.

Checkpointing is the mostly used mechanism to support fault tolerance in HPC applications. While checkpointing and its behaviors have been well studied on high-end computing systems (HEC), there is limited study on evaluating the impact of cloud on the performance of checkpointing. Checkpointing on the cloud may present distinctive features differentiating it from the traditional HEC checkpointing. For example, in the cloud, hard disk of one physical machine is shared by multiple virtual machines, which burdens the performance since the checkpointing requests are usually issued in a burst. Also, if the checkpointing images go to the dedicated storage service in the cloud, more variations may be presented due to the fact that
multiple virtual machines of one node share one NIC and makes the performance difficult to predict.

2 Proposed Work
The objective of this project is to study the reliability implications of running HPC applications on the cloud. As the first stage of the project, we will focus on evaluating the performance of parallel checkpointing on the cloud.

The project requires the software that supports parallel system-level checkpointing. More specially, we will build the experiments on Open MPI [4] and BLCR [5]. Open MPI as a MPI-2 implementation that fully supports system level checkpointing. The lower level component that implements checkpointing on Open MPI is BLCR, a system level checkpointing on Linux. PVFS2 [6] is also required to evaluate the potential of checkpointing on the parallel file systems. The experiments will study the checkpointing performance on classical MPI benchmarks such as NPB [7] and MPI 2007 [8]. We are also planning to test the classical scientific computing modules such as Matrix Multiplication on FutureGrid.

The Scalable Computing Software (SCS) [9] Lab from Illinois Institute of Technology (IIT) has accumulated extensive expertise on the research of HPC reliability. Funded by NSF, the FENCE [10] project has been conducted successfully at SCS. The objective of FENCE is to build a Fault awareness ENabled Computing Environment for high performance computing. As part of the project, we have conducted research on the performance evaluation and optimization of checkpointing in HPC [11, 12, 13]. The experiments were based on the Sun ComputeFarm [14] of the SCS lab to evaluate the performance of the parallel checkpointing in cluster environment. The ComputeFarm is composed of 64 Sun File X2200 servers. Each node is equipped with
2.7GHz Opteron quad-core processors, 8GB memory and 250GB SATA hard drive. All the nodes are connected by 1 gigabit NICs in a fat tree topology. We have carried out experiments to examine the checkpointing performance for the NPB benchmarks. Different benchmarks present different checkpointing overhead in our system. We have observed a decreasing
performance as the application scales up, which implies the scalability limitations of parallel checkpointing in the cluster.

On the FutureGrid platform, we will scale the application size of our previous experiments with higher degree, potentially to the entire system if possible. The performance will be tested on both local storage of each virtual machine and the shared storage. We expect to reach scales of at least 2048 VM instances and evaluate the corresponding checkpointing performance. The
initial part of the project is planned to last about 3 months, with an estimated total CPU hours required of 409600 (2048vms*4hrs*50runs).

This project will evaluate the performance results of parallel checkpointing, observe the bottlenecks and propose solutions for the potential problems. The project will deliver a paper submission to a referenced conference (e.g. Supercomputing with an April 2011 deadline should be realistic) and be part of the Ph.D thesis of Hui Jin.

[1] 1st ACM Workshop on Scientific Cloud
   http://dsl.cs.uchicago.edu/ScienceCloud2010/
[2] N. DeBardeleben, J. Laros, J. T. Daly, S. L. Scott, C. Engelmann, and B. Harrod. High-End    Computing Resilience: Analysis of Issues Facing the HEC Community and Path-Forward  for  Research and Development. White paper, 2009. Online:
http://www.csm.ornl.gov/~engelman/publications/debardeleben09high-end.pdf
[3] F. Cappello, A. Geist, B. Gropp, L. Kale, B. Kramer, and M. Snir, Toward Exascale    Resilience, International Journal of High Performance Computing Applications, vol. 23,   no. 4, pp. 374-388, 2009.
[4] Open MPI Website, http://www.open-mpi.org/
[5] BLCR Website, https://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml
[6] PVFS2 Website, http://www.pvfs.org
[7] NAS Parallel Benchmarks, http://www.nas.nasa.gov/Resources/Software/npb.html
[8] MPI2007 Website, http://www.spec.org/mpi2007/
[9] SCS Website, http://www.cs.iit.edu/~scs
[10] FENCE Project Website, http://www.cs.iit.edu/~zlan/fence.html
[11] H. Jin, Checkpointing Orchestration for Performance Improvement, DSN 2010 (Student     Forum)
[12] H. Jin, Y. Chen, T. Ke and X.-H. Sun, REMEM: REmote MEMory as Checkpointing Storage, CloudCom 2010.
[13] H. Jin, T. Ke, Y. Chen and X.-H. Sun, Checkpointing Orchestration: Towards Scalable HPC   Fault Tolerance. IIT Technical Report, Sep 2010.
[14] Sun ComputeFarm at SCS Website: http://www.cs.iit.edu/~scs/research/resources.html