High Performance Spatial Data Warehouse over MapReduce

Project Details

Project Lead
Fusheng Wang 
Project Manager
Fusheng Wang 
Project Members
Ablimit Aji, Sanjay Agravat, RUBAO LI, Qiaoling Liu, Yin Huai  
Supporting Experts
Yang Ruan, Yuduo Zhou, Zhenhua Guo  
Institution
Emory University, Center for Comprehensive Informatics  
Discipline
Pathology (613) 
Subdiscipline
11.07 Computer Science 

Abstract

Querying and analyzing large volumes of spatially oriented scientific data becomes increasingly important for many applications. For example, analyzing high-resolution digital pathology images through computer algorithms provides rich spatially derived information of micro-anatomic objects of human tissues. The spatial oriented information and queries at both cellular and sub-cellular scales share common characteristics of "Geographic Information System (GIS)'', and provide an effective vehicle to support computer aided biomedical research and clinical diagnosis through digital pathology. The scale of data could reach a million derived spatial objects and hundred million features for a single image. Managing and querying such spatially derived data to support complex queries such as image-wise spatial cross-matching queries poses two major challenges:  the high complexity of geometric computation and the "big data'' challenge. In this paper, we present a system Hadoop-GIS to support high performance declarative spatial queries with MapReduce. Hadoop-GIS provides an efficient real-time spatial query engine  RESQUE with dynamically built indices to support on the fly spatial query processing. To support high performance queries with cost effective architecture, we develop a MapReduce based framework for data partitioning and staging,  parallel processing of spatial queries with RESQUE, and feature queries with Hive, running on commodity clusters. To provide a declarative query language and unified interface, we integrate spatial query processing into Hive to build an integrated query system. Hadoop-GIS demonstrates highly scalable performance to support our query cases.

Intellectual Merit

Hadoop-GIS is a unique system that integrates DBMS technologies into MapReduce technologies to provide a hybrid architecture to support high performance complex spatial queries with expressive query languages. This also drives the research direction on marrying DBMS and MapReduce technologies. Hadoop-GIS takes a GIS-like approach by mapping pathology imaging problems into GIS problems, a highly innovative approach in the field. The approach provides significant new insights for biomedical research with digital pathology, and creates many new opportunities.

Broader Impacts

Hadoop-GIS is a system for interdisciplinary biomedical research, which accelerates the diagnosis and understanding of brain tumor for better cure. The technologies also help to train computer science researchers and students to learn the state-of-the-art of MapReduce technologies, and the system is used for an advanced graduate course for lectures and projects (https://web.cci.emory.edu/confluence/display/CS730R/). The technologies could also be generalized for similar problems in different domains, such as cross-matching problems in digital sky survey projects, and oil exploration problems.

Scale of Use

Data staging for Hadoop file system, many cores/VMs (thousands) to run Hadoop jobs for a short period (expected to be several hours each time), and run it every couple of weeks.

Results

1. We have provided scalability testing on the futuregrid platform with 320 cores based on Hadoop. A technique report has been published: 
http://confluence.cci.emory.edu:8090/confluence/download/attachments/4033450/CCI-TR-2012-5.pdf?version=1&modificationDate=1341865395000
We are also working on two papers based on the project.

2. We have developed an open source system Hadoop-GIS by extending Apache Hive project with spatial querying capabilities. The URL is:
http://web.cci.emory.edu/confluence/display/HadoopGIS