Performance Evaluation of Data Intensive Scientific Applications

Project Details

Project Lead
Sriram Krishnan 
Project Manager
Sriram Krishnan 
Institution
University of California, San Diego, San Diego Supercomputer Center  
Discipline
Computer Science (401) 
Subdiscipline
---302 Geosciences--- 

Abstract

We would like to perform a detailed benchmarking effort for data intensive applications using the resources provided by FutureGrid. While synergistic with our other research funded by the NSF Cluster Exploratory (CluE) and SDSC's Triton Resource Opportunity (TRO) program, this work will showcase the use of supercomputing facilities for large-scale data processing, complementing work that we will perform on other private and public grid and cloud-based environments. Under the scope of this project, we wish to install our benchmark data sets; and execute a number of benchmark queries on these data sets, using both "traditional" and Hadoop-based solutions for serving these data sets. The application domains that we are interested in spans multiple scientific disciplines - especially geosciences and bioinformatics. In the area of geosciences, we are interested in benchmarking the performance of database and Hadoop-based implementations to serve high resolution geospatial datasets. In the area of bioinformatics, we are interested in study the performance of various traditional and cloud-enabled codes for next generation sequencing.

Intellectual Merit

The intellectual merit of this research is in its contribution towards understanding the performance tradeoffs and feasibility of shared-nothing Hadoop-style programming model for large-scale scientific data-intensive computing, and comparing it with traditional HPC-style approaches. The results from this study will provide a basis for developing data-intensive scienfic codes that leverage grid and cloud resources in an optimal fashion.

Broader Impacts

The broader impact of this study is a reassessment of how scientific data intensive applications are implemented, and how data sets are hosted and served to a broad community. A direct impact is the development of scientific codes in the geosciences and bioinformatics that are optimized to leverage the capabilities of grid and cloud resources, and the associated programming models.

Scale of Use

We anticipate needing around 50,000 hours over the course of the next year.