Investigation of Data Locality and Fairness in MapReduce

Project ID
FG-261
Project Categories
Computer Science
Completed
Abstract
Traditional High-Performance Computing (HPC) environments separate compute and storage resources and adopt "bring data to compute" strategy. MapReduce is a data parallel model that makes use the same set of nodes for both compute and storage. As a result, data affinity is integrated into the scheduling algorithm to bring compute to data. In data-intensive computing, data locality becomes more important than before because it can potentially reduce network traffic significantly. In this project, we try to investigate the data locality of MapReduce in detail, and do the following things: 1) we summarize important system factors and theoretically deduce the relationship between those factors and data locality; 2) we analyze the state-of-the-art Hadoop scheduling algorithms to investigate their performance; 3) we propose new scheduling algorithms that yield optimal data locality; 4) we integrate data locality and fairness; 5) we compare our algorithms with the default Hadoop scheduling algorithm.

Use of FutureSystems
We ran extensive simulation experiments on FutureGrid bare metal machines.
Scale of Use
We used 1 - 5 of HPC nodes.