QoS-driven Storage Management for High-end Computing Systems

Project Details

Project Lead
Ming Zhao 
Project Manager
Ming Zhao 
Institution
Florida International University, School of Computing and Information Sciences  
Discipline
Computer Science (401) 

Abstract

The objective of this project is to address the research challenges in application Quality of Service (QoS) driven storage management in high-end computing (HEC) systems, in order to support the allocation of storage resources on a per-application basis as well as the efficient execution of applications with I/O demands varying by several orders of magnitude. This project proposes a new storage management approach that allows application-tailored bandwidth provisioning and optimization through virtualizing parallel file systems (PFS) and by incorporating autonomic capabilities into the management.

Intellectual Merit

In today’s HEC systems, the PFS-based storage infrastructure is typically shared by one or more computing infrastructures, serving application I/O demands in a best-effort manner. As the size of such systems grows, it becomes increasingly common to have many parallel and non-parallel applications running concurrently with distinct I/O characteristics and requirements. However, today’s storage systems are unable to recognize these different application I/O workloads – they only see generic I/O requests arriving from the computing infrastructure. Neither is the storage system capable of satisfying application users’ different I/O performance needs – it is architected to meet the maximum throughput requirement set by the system designers. These limitations prevent applications from efficiently utilizing the HEC resources while achieving their desired performance. They also present a hurdle for the scale-up of HEC systems to include up to millions of processors and to support many large, data-intensive applications. This project will research and develop novel solutions to address the above challenges, focusing on the following aspects: 1. Per-application I/O bandwidth allocation based on virtualizing contemporary PFSs. Virtual PFSs can be dynamically created upon shared physical storage resources on a per-application basis, where each virtual PFS gets a specific share of the overall I/O bandwidth according to its application’s needs. 2. PFS management services that control the lifecycle and configuration of dynamic per-application virtual PFSs. These services will also support the monitoring of application I/Os and the reservation of storage resources needed for the purpose of improving application performance and resource utilization. 3. Efficient I/O bandwidth allocation through autonomic, fine-grained resource scheduling across applications. With per-application virtual PFSs, the storage resource allocation will incorporate dynamic, coordinated scheduling algorithms as well as optimizations based on application I/O profiling and prediction. 4. Scalable application checkpointing based on virtual PFS optimizations that are specific to the characteristics and demands of checkpoint I/Os. This research will present a new paradigm for HEC storage resource management. It recognizes the lack of per-application storage bandwidth allocation and QoS management in current systems and proposes novel techniques to address it based on PFS virtualization and autonomic management. This approach is applicable to different PFS-based storage systems and can be readily integrated with existing as well as future HEC systems. It can provide efficient support for applications with distinct I/O access patterns and bandwidth demands, where the proposed techniques for large-scale checkpointing enable much needed optimization for an important type of I/Os in typical HEC systems.

Broader Impacts

The proposed approach has the potential to drastically improve the state of the art in I/O management in existing HEC systems, and it will also generate an impact on the design of future systems. The PIs will leverage their connections with industry and federal laboratories to establish collaboration on evaluating and fine-tuning the proposed techniques. The prototype system will be deployed in cyberinfrastructures developed by the PIs and its performance will be quantitatively evaluated with applications from various scientific domains (including Hurricane Mitigation and Bioinformatics). The project results will include implementations based on free software, so they can be flexibly reused by the community to meet their own large-scale high-performance computing needs.

Scale of Use

100 to 200 VMs for one to two months.

Results