Wide area distributed file system for MapReduce applications on FutureGrid platform

Project Details

Project Lead
Lizhe Wang 
Project Manager
Lizhe Wang 
Institution
Indiana University, Pervasive Technology Institute  
Discipline
Computer Science (401) 

Abstract

Map/Reduce is a software framework introduced by Google for processing datasets with the help of parallel processing using a large number of computer nodes. Map/Reduce has received significant attention as a programming model for various scientific problems using a wide variety of computing architectures, including large-scale compute clusters, GPGPU, and multi-core architectures. However, some data-intensive computing applications such as those used as part of the Large Hadron Collider (LHC) computing project, earth observation sciences, and biomedical applications require large-scale distributed data processing across multiple data centers connected via high speed networking. In this project, we aim to develop a software framework for Map/Reduce across distributed data centers in support of data-intensive applications.

Intellectual Merit

"We will develop a Wide area Distributed File System, a high performance large-scale file system that integrates existing data management toolkits. The Cyberaide Farm is not expected to provide a complete set of file system functionalities; however, it will deliver high performance data transfer and replication management across distributed data centers. We will develop a Cyberaide Farm Runtime System, which extends the Apache Hadoop framework and develops a set of APIs to accommodate the developer and user communities."

Broader Impacts

Our work will be implemented and tested on distributed production data centers powered by the FutureGrid projects. We will use standard data-intensive computing benchmarks, high performance engineering applications and data-intensive social computing applications to evaluate our implementations.

Scale of Use

Distributed Hadoop clusters that across wide area, each cluster at least has 8 compute nodes