Data Semantics Aware Clouds for High Performance Analytics

Project ID
Project Categories
Computer Science
NSF Grant Number
Today's cutting-edge research deals with the increasing volume and complexity of data produced by ultra-scale simulations, high resolution scientific equipment and experiments. Representatives include analytics- and simulation- driven applications such as astrophysics data analysis, bioinformatics, etc. In these fields, scientists are dealing with large amounts of data and processing (analyzing) them to explore new concepts and ideas. Many scientists are exploring the possibilities of deploying applications with large scale of data on cloud computing platforms such as Amazon EC2 and Windows Azure. Recently, the successful deployment of eScience applications on clouds motivates us to deploy HPC analytics applications to the cloud, especially MapReduce enabled. The reason behind this lies in a fact that eScience applications and HPC analytics applications share some important features: terascale or peta-scale data size and high cost to run on single or several supercomputers or large platforms. However, HPC analytics applications bear some distinct characteristics such as complex data access patterns, interest locality, and which pose new challenges to its adoption of clouds. However, current solutions do not deal well with these challenges and have several limitations. This project is the development of a data semantics aware framework to support HPC analytics applications on clouds.
Use of FutureSystems
Conduct developing, debugging, testing and evaluation of our proposed new data-semantics aware software systems and tools.
Scale of Use
We need a stable testing environment which can enable us to set up an upto 128-node or 256-node Hadoop cluster testbed. We are able to perform sensitivity test with nodes varying from 16-node to 256-node Hadoop cluster. The storage capacity should be as much as tens/hundreds of Terabytes. We plan to use for 3 years.