Evaluation of MPI Collectives for HPC Applications on Distributed Virtualized Environments

Project Details

Project Lead
Ivan Rodero 
Project Manager
Ivan Rodero 
Project Members
Aditya Devarakonda, Mike Puntolillo, Marc Gamell, David Villegas, Gabriel Diaz, Tong Jin, Moustafa AbdelBaky, Georgiana Haldeman, Suhas Aithal  
Supporting Experts
Bingjing Zhang  
Institution
Rutgers University, NSF Center for Autonomic Computing  
Discipline
Computer Science (401) 

Abstract

Natural hazards such as earthquakes, tsunamis or hurricanes impact different levels of society. As a result, there is a critical need for providing accurate and timely information that can enable effective planning for and response to potential natural disasters. In this project, we consider as a main use case, the Weather Research and Forecasting (WRF), which is a state-of-the-art numerical modeling application developed by the National Center for Atmospheric Research (NCAR) for both operational forecasting and atmospheric research. A measure of its significance and success is that many meteorological services and researchers worldwide are rapidly adopting WRF. WRF simulations demand a large amount of computational power in order to enable practical (i.e. accurate and high-resolution) simulations. Since the window of time between the detection of a hurricane and its arrival is relatively small, these simulations need to execute as rapidly as possible. This necessitates that the inter-process communication time delay be as small as possible in order to speed up the time to completion. However, on distributed virtualized environments the overhead of communication increases due to the abstraction layer. We intend to explore how the MPI collective communication is affected by this abstraction and where the bottlenecks occur. This would help in understanding which short-comings need to be addressed in introducing HPC applications, like WRF, to distributed virtualized environments (i.e. Clouds) where the scalability and on-demand resource provisioning may prove vital for urgent need.

Intellectual Merit

Cloud computing has grown significantly in recent years however more research is necessary in order to assess the viability of providing HPC as a service in a cloud. Even with high-speed, gigabit-scale network connectivity and more recently, infiniband, the time delay of communication on distributed virtualized environments is high in comparison to Grids and Supercomputers [1]. Through our work we will extensively study the necessary requirements to deploy HPC applications in a virtualized environment complementing exiting research on FutureGrid [2], taking into account key metrics (e.g. throughput, communication latency, etc.) and more detailed analysis of MPI programs. This will help us understand where Clouds stand in relation to Grids/Supercomputers thus allowing us to focus our efforts on improving the current Cloud infrastructure to perform on-par with conventional supercomputing facilities. Our motivation is to develop a new standard for HPC Clouds where the large scalability, high availability and fault-tolerance would provide scientists with a unique advantage when running simulations. We intend to involve undergraduate students in this project to introduce important topics and open problems in High Performance Computing and help them make an early impact in the field through this project. [1] S. Ostermann, A. Iosup, N. Yigibasi, R. Prodan, T. Fahringer, and D. Epema, “An early performance analysis of cloud computing services for scientific computing,” Delft University of Technology, Tech. Rep., December 2008. [2] Younge, A. J., R. Henschel, J. Brown, G. von Laszewski, J. Qiu, and G. C. Fox, "Analysis of Virtualization Technologies for High Performance Computing Environments", The 4th International Conference on Cloud Computing (IEEE CLOUD 2011), Washington, DC, IEEE, 07/2011

Broader Impacts

The current Cloud infrastructure suffers from the overhead of utilizing a virtualized environment. This has lessened the impact of Cloud computing on the high performance computing community and the scientific community, at large. Through our project we will investigate and implement solutions to decrease this overhead and establish Clouds as a viable addition to the Grid and HPC cyber-infrastructures. This process would provide scientists with an extra alternative whose benefits include on-demand provisioning (a problem in conventional Grids/HPC), enhanced fault-tolerance and dynamic scale up/scale down or elasticity. These attributes make Cloud computing very appealing however the computation time is the main issue. Our project attempts to address this problem and, if successful, should change the way Clouds are viewed within the scientific community. We intend to publish any and all findings through scholarly papers in major journals, workshops, conferences, etc. This is a key issue and we plan to disseminate the work in a timely manner so that the scientific community may benefit from the work done. The cost of building and maintaining a HPC facility is much greater than, potentially, using pay-per-use HPC Clouds. This makes clouds especially attractive because any institution can use cloud services to run HPC application within their budget constraints without needing to solely use HPC or Grids, which usually have long waiting periods and usage limits. This would extend HPC research to more institutions since ample resources are available in the form of HPC Clouds. We feel that research on HPC Clouds is essential to furthering the current state of the field; where positive results could benefit a large sector of the scientific computing community and allow high performance computing to become more pervasive.

Scale of Use

We will need a few VMs to setup the environment and prepare the actual experiments. It may take some weeks. Then, we will run a set of run at different scales using different systems and different configurations. Analysis will be performed between experiments in order to evaluate only key scenarios of interest. This process may take some months but the use of the resources will not be continued.

Results

We are working on setting up a project web site. Publications: [1] D. Villegas, I. Rodero, A. Devarakonda, Y. Liu, N. Bobroff, L. Fong, S.M. Sadjadi, M. Parashar, "Leveraging Cloud Federation in a Layered Service Stack Model", Journal of Computer and System Sciences, Special Issue on Cloud Computing, to appear.