Large scale data analytics

Project Details

Project Lead
Yogesh Simmhan 
Project Manager
Yogesh Simmhan 
Project Members
Alok Gautam Kumbhare, Charith Wickramaarachchi, Nam Ma, Hsuan-Yi Chu  
Supporting Experts
Bingjing Zhang  
Institution
University of Southern California, Computer Engineering Division  
Discipline
Electrical and Related Engineering (106) 
Subdiscipline
11.04 Information Sciences and Systems 

Abstract

The pervasive deployment of environmental sensors and instruments that monitor natural and human activities are leading to the generation of large data sets at fine granularities of time and space. Science and eEngineering applications can shift from a modeling and empirical testing of hypothesis approach to defining predictive models based on current and historical information. Such data analytics applications for eScience and eEngineering leverage data mining and machine learning methods to analyze large scale information to support research, development and even operations. However, such applications are data and compute intensive, and the changing nature of data requires them to run often. In this project, we propose to develop and scale machine learning algorithms onto elastic Cloud infrastructure to build predictive models of power forecast in smart electricity grids. These models can subsequently be used to make realtime predictions of energy usage at campus and city scales for energy conservation and planning. Programming models such as Map-Reduce/Hadoop and DAGs will be used to describe these applications and execute them on public Cloud platforms such Eucalyptus to evaluate their efficacy for both static and streaming datasets.

Intellectual Merit

Large scale data mining and machine learning are compute and data intensive but are less studied for executing on distributed systems, limiting users to run them run on smaller samples of dataset to fit single machines even though larger datasets are available. Most algorithms that are ported to the Cloud are inherently loosely coupled, but several commonly used modelling techniques are not naively loosely-coupled. We will study scalable machine learning models for classification, such as regression trees and ANN, that use novel algorithms or mapping techniques for scalable execution on the Cloud.

Broader Impacts

The result of our work will allow for a broader and more effective use of data mining for eScience and eEngineering. We will apply the tools and algorithms we develop to the smart power grid domain for energy use forecasting, but the machine learning algorithms will themselves be generally applicable. All our research and development will be publicly available and the research results published in workshops and conferences for access by the broader community.

Scale of Use

We expect to use a few VMs for regular (daily) experiments and 100's of VMs for testing scalability once a week.