Exploring map/reduce frameworks for users of traditional HPC

Project ID
Project Categories
Technology Evaluation
Project Keywords
The map/reduce paradigm has become a critical part of keeping pace with the "deluge of data" coming from data sources in domain sciences such as genetics, cosmology, and high-energy physics. While data-enabled science is becoming widely accepted as a key component of scientific discovery, an understanding of exactly how to transform terabytes of raw data into useful information is not nearly as widespread. Map/reduce's roots in Java and web-oriented technologies have created a perceptible barrier to entry for users of "traditional" high-performance computing (HPC) whose core competencies include languages such as Fortran and C. Although the map/reduce framework can be generalized to any language, a practical understanding of exactly how to extend map/reduce to applications and languages with which HPC users are comfortable is not widespread. Thus, a gap between potential and realized applications exists within the context of data-intensive computing with map/reduce. As such, this project aims to develop a practical understanding of existing map/reduce frameworks and methods among the HPC professionals (e.g., XSEDE User Services staff) who provide guidance to the HPC user community within XSEDE and on an ad hoc basis. By developing this hands-on working knowledge of applying map/reduce methodologies in traditional HPC domains, we hope to provide more practical and useful guidance to traditional users of HPC whose research is now involving data-intensive computation. This will involve (1) exploring map/reduce in the context of traditional HPC languages (Fortran, C, and Python) via Hadoop streaming and MapReduce MPI's native support, (2) evaluating performance and ease-of-use of these frameworks for existing scientific problems, (3) developing documentation and boilerplate code for potential users, and (4) establishing tutorials for migrating existing Fortran/C/Python kernels to distributed map/reduce.
Use of FutureSystems
FutureGrid is the enabling technology behind this project. While the members of this project will have compute cycles on traditional HPC resources by virtue of being XSEDE users and staff, it is not sensible to compete with the XSEDE user base for cycles in the context of carrying out the testing frameworks. Maintaining a low barrier to entry is a key goal, and since a rich VM-based map/reduce ecosystem already exists for technologies like Hadoop and Twister, FutureGrid is an ideal platform for rapidly assessing how suitable these technologies would be for key domain-specific problems. Should the deployment of "virtual appliances" prove to be an effective tool in allowing researchers to easily adapt map/reduce to typical problems, FutureGrid is also an excellent development testbed for such appliances.

Furthermore, FutureGrid resources are equipped with hardware that is relevant to traditional HPC users such as Infiniband and the Torque resource manager. This allows the project to be as realistic and relevant as possible when assessing frameworks (e.g., MapReduce MPI over Infiniband) and ease of use (e.g., adapting existing Torque scripts to use map/reduce). Finally, FutureGrid was architected with scientific research in mind and already has a broad community and knowledgebase that has made tremendous progress in making these cloud-oriented technologies accessible to researchers. This vastly reduces the startup overhead for the specific goals of this project.
Scale of Use
This project is not directly funded and, as such, is subject to intermittent burst demand as members' time allows. However it is intended to be a long-term, ongoing effort to continually assess new methodologies as they emerge.