Investigation of Data Locality and Fairness in MapReduce

Project Details

Project Lead
Zhenhua Guo 
Project Manager
Zhenhua Guo 
Institution
Indiana University, Pervasive Technology Institute  
Discipline
Computer Science (401) 

Abstract

Traditional High-Performance Computing (HPC) environments separate compute and storage resources and adopt "bring data to compute" strategy. MapReduce is a data parallel model that makes use the same set of nodes for both compute and storage. As a result, data affinity is integrated into the scheduling algorithm to bring compute to data. In data-intensive computing, data locality becomes more important than before because it can potentially reduce network traffic significantly. In this project, we try to investigate the data locality of MapReduce in detail, and do the following things: 1) we summarize important system factors and theoretically deduce the relationship between those factors and data locality; 2) we analyze the state-of-the-art Hadoop scheduling algorithms to investigate their performance; 3) we propose new scheduling algorithms that yield optimal data locality; 4) we integrate data locality and fairness; 5) we compare our algorithms with the default Hadoop scheduling algorithm.

Intellectual Merit

This project tries to address an important issue in MapReduce : data locality. Our proposed algorithms yield optimal data locality and can dramatically reduce the time of data movement. The integration of data locality and fairness allows users to make the best tradeoff based on their environments and requirements.

Broader Impacts

In the era of data-intensive computing, we all know data locality is critical because it is not efficient to move extreme amount of data during data processing. This project can help researchers to better understand MapReduce data locality in a quantitative way. In addition, this project produces some insightful conclusions and results that pave the foundation for further research on data parallel systems.

Scale of Use

We used 1 - 5 of HPC nodes.

Results

Our experiment results show that our proposed algorithms improve data locality and outperform the default Hadoop scheduling substantially. For example, the ratio of data-local tasks is increased by 12% - 14% and the cost of data movement is reduced by up to 90%.
The detailed results of this project have been presented in two papers: "Investigation of data locality and fairness in MapReduce" [bib]Guo:2012:IDL:2287016.2287022[/bib], and "Investigation of Data Locality in MapReduce" [bib]fg-261-05-2012-a[/bib].