Exploring map/reduce frameworks for users of traditional HPC

Project Details

Project Lead
Glenn K. Lockwood 
Project Manager
Glenn K. Lockwood 
Institution
University of California San Diego, San Diego Supercomputer Center  
Discipline
Computer Science (401) 

Abstract

The map/reduce paradigm has become a critical part of keeping pace with the "deluge of data" coming from data sources in domain sciences such as genetics, cosmology, and high-energy physics. While data-enabled science is becoming widely accepted as a key component of scientific discovery, an understanding of exactly how to transform terabytes of raw data into useful information is not nearly as widespread. Map/reduce's roots in Java and web-oriented technologies have created a perceptible barrier to entry for users of "traditional" high-performance computing (HPC) whose core competencies include languages such as Fortran and C. Although the map/reduce framework can be generalized to any language, a practical understanding of exactly how to extend map/reduce to applications and languages with which HPC users are comfortable is not widespread. Thus, a gap between potential and realized applications exists within the context of data-intensive computing with map/reduce. As such, this project aims to develop a practical understanding of existing map/reduce frameworks and methods among the HPC professionals (e.g., XSEDE User Services staff) who provide guidance to the HPC user community within XSEDE and on an ad hoc basis. By developing this hands-on working knowledge of applying map/reduce methodologies in traditional HPC domains, we hope to provide more practical and useful guidance to traditional users of HPC whose research is now involving data-intensive computation. This will involve (1) exploring map/reduce in the context of traditional HPC languages (Fortran, C, and Python) via Hadoop streaming and MapReduce MPI's native support, (2) evaluating performance and ease-of-use of these frameworks for existing scientific problems, (3) developing documentation and boilerplate code for potential users, and (4) establishing tutorials for migrating existing Fortran/C/Python kernels to distributed map/reduce.

Intellectual Merit

This project aims to evaluate the feasibility, applicability, and ease of applying the map/reduce methodology and existing frameworks to problems specifically in domain sciences. The significance of this work, though, is in exploring these areas from the perspective of a traditional HPC user rather than a cloud-services provider or developer of a specific map/reduce product. A model application for this evaluation will be the digestion of gene sequence variations from Variant Call Format (VCF) for ease of input into a genomics database application. This is a simple yet common task is representative of a widespread bottleneck in traditional scientific analyses—data reduction—that will only get worse as the level of parallelism in simulations increases.

Broader Impacts

The goal of this project is inherently broad in that its purpose is to develop an understanding of how map/reduce methodology can be used by computational scientists of all domains with as low of an entry barrier as possible. By equipping the professionals who work directly with users of high-performance computers (XSEDE User Services staff) with an understanding of the strengths of map/reduce frameworks and how to apply them using the native languages of user applications, this project will narrow the gap between traditional HPC and these emerging data-intensive techniques. This knowledge will be made available to the public in the form of short written tutorials, performance comparisons, and whitepapers that are made available online in website, wiki, and blog format.

Scale of Use

This project is not directly funded and, as such, is subject to intermittent burst demand as members' time allows. However it is intended to be a long-term, ongoing effort to continually assess new methodologies as they emerge.

Results

  1. Guide to Running Hadoop Clusters on Traditional HPC
  2. Guide to Writing Hadoop Jobs in Python with Hadoop Streaming
  3. Guide to Parsing Variant Call Format (VCF) Files with Hadoop Streaming
Source code for these guides is also available on GitHub.  Work is ongoing.