Managing an Adaptive Cloud Cache for Supporting Data-Intensive Applications

Project Details

Project Lead
David Chiu 
Project Manager
David Chiu 
Supporting Experts
Saliya Ekanayake  
Institution
Washington State University, School of Engineering and Computer Science  
Discipline
Computer Science (401) 

Abstract

Today's computing projects, including myriad data mining, analysis, and scientific applications, are becoming increasingly data-intensive. Despite steady advances in hardware and com- puting paradigms, this so-called data deluge continues to extend processing times due to various reasons. For example, in popular scenarios involving parallel and distributed frameworks, large data transfers are necessarily invoked among various machines to communicate or merge results. Meanwhile, the emergence of Cloud computing has been apropos for enabling solutions to the aforementioned problem. In response to the data deluge, processing times might be expedited by harnessing the Cloud' s resource elasticity, i.e., the on-demand nature of virtual machine allocation and ostensibly infinite storage units, for a price. Traditionally, many data-intensive applications, including analysis and scientific workflows, are known to contain frequent redundant overlaps in their computation patterns. This implies that many applications can benefit from caching inter- mediate and final computed results for reuse. Thus, by avoid- ing repeated heavy computation and amortizing data movement inherent to those processes, data-intensive applications can be accelerated. However, certain challenges lie within managing such a data cache in the Cloud. For instance, the structure of the physical storage hierarchy (machine memory, local and network disks, persistent storage) should adapt to application' s performance requirements. Furthermore, data placement policies and heuristics for resource consolidation must also be developed to optimize an application' s performance and cost effectiveness. We propose developing an elastic Cloud cache manager, which we aim to release to the public as an open source project, that seeks to address the aforementioned challenges.

Intellectual Merit

Our proposed cache would provide a 2-tiered system, capable of (1) predicting and managing costs of provisioning Cloud resources and (2) adaptively manages cache data within the provisioned resources through promotion and demotion of data blocks in the storage hierarchy to optimize performance.

Broader Impacts

A cost-conscious cache would be useful to multiple stakeholders, including helping accelerate scientific applications and general service-oriented applications.

Scale of Use

Several (10 to 20) VMs for experiments and courses.