Seismic Data Processing Platform Based on MapReduce and NoSQL

Project Details

Project Lead
Tony Liu 
Project Manager
Tony Liu 
Project Members
Yinzhi Wang  
Supporting Experts
Hyungro Lee  
Institution
Indiana University Bloomington, Computer Science Department  
Discipline
Geosciences (302) 
Subdiscipline
11.07 Computer Science 

Abstract

We propose to develop a novel framework for seismology data processing founded on two relatively new but stable technologies: a NoSQL database system called MongoDB, and a scalable parallel processing framework called Hadoop. The processing system we propose will load core metadata from IRIS DMC into MongoDB to allow a seismologist to build a well-managed, working dataset for input into the system. We will use Hadoop to provide a scalable processing system. That is, Hadoop provides a mechanism to process the same dataset on systems ranging from a single multicore desktop to a state-of-the-art cluster with thousands of nodes. The Hadoop framework abstracts the processing flow into a concept called MapReduce. MapReduce vastly simplifies the process of parallelization as most standard serial processing algorithms can be adapted by writing a Map and Reduce wrapper procedure for the algorithm. Hadoop then handles scheduling and flow of data through the system. We will use MongoDB to manage data created at all phases of the processing systems. MongoDB is a document database which will allow us to manage four additional type of data that are traditionally treated very differently: (1) process generated metadata, (2) processed waveform data, (3) processing parameters used as input to individual algorithms, and (4) log output from individual processing algorithms. Integration of MongoDB and Hadoop is an established technology. The scripting language called Pig Latin will be used to build processing workflows. Construction of the framework itself is largely plug-and-play with initial development work need to define an efficient schema for MongoDB and implementation of MapReduce functions for a suite of algorithms we will need for testing. Research will center around four primary research questions. (1) How can we most efficiently extract, transform, and load data into MongoDB? (2) What is the IO performance of the framework with different configurations? (3) What are the tradeoffs with processing time scale of different algorithms? (4) How scalable is this system for seismic processing?

Intellectual Merit

The science of seismology is a victim of its own success. Vast increases in the volume of data available to the community made possible through IRIS and Earthscope have revealed fundamental flaws in the community’s infrastructure for processing these data. All standard tools used by the community are at least 20 years old and most are based on archaic computing concepts. We assert this is strongly limiting progress in the field. The system we propose could fundamentally change the way data is handled in our community and open the door to a whole new realms of research using this vast new pool of data. It can also open the door for more reproducible science by encapsulating a workflow in a single end-to-end system.

Broader Impacts

Although the focus of this project is seismology, the processing model is generic with demonstrated success in other fields such as bioinformatics. Broad areas of the geosciences could potentially apply the same approach to their discipline. The difference is only in the database design and algorithms to be adapted to Hadoop. As a concrete step in that direction we propose a short course in association with the 2017 Earthscope National Meeting to introduce our concepts to a broader community. The project promotes cross-disciplinary research by bringing together a PI and students from geophysics and computer science. The project will provide support for graduate student in each field to promote this goal.

Scale of Use

I want to test how scalable this system is for seismic processing so I will need to scale VMs for an experiment.