Streaming in the Clouds

Project Details

Project Lead
Radu Tudoran 
Project Manager
Pierre Riteau 
Project Members
Pierre Riteau, Radu Tudoran, Sergey Panitkin  
Computer Science (401) 


In the recent years BigData has become an important aspect of scientific discoveries - a process referred to as the Forth Paradigm. From the wide spectrum of applications and acquisitions methods, the ones that will generate the biggest amounts of data fall in the category of streaming data, i.e., networks of sensors, observatories, telescopes or experiments such as CERN LHC. As the amount of acquired information grows and the location of data sources are increasingly geographically distributed, it becomes important to process the data in scalable and efficient ways. Cloud computing presents an interesting option for a scalable processing platform. However, the question arises how to best use cloud computing capabilities for geographically distributed stream processing. In this work, we explore and analyze different approaches to streaming data to the cloud and evaluate them in the context of multiple cloud offerings including Microsoft Azure, and and FutureGrid's Nimbus and OpenStack installations. We show, using an ATLAS application, that using the right approach to streaming data can improve the average data rates three times. 

Intellectual Merit

The project goal is to understand how streaming is supported by cloud environments. This is a key aspect for the future, as the nature of BigData in the future is expected to be of stream data.

Broader Impacts

The results and observations can be used by all scientific researchers that will have to analyze such data (i.e. stream data). The observations and discovery will allow them to optimally scale and adjust their experiment configuration in order to process all the amounts of data they need.

Scale of Use

The number of VMs used are in the order of tens up to hundred. As the goal is to understand how BigData streaming is supported at large scale, scalability in terms of number of nodes/ VMs is important.