SamzaSQL - Scalable Fast Data Management with Streaming SQL

Project Details

Project Lead
Milinda Pathirage 
Project Manager
Milinda Pathirage 
Institution
Indiana University, Data to Insight Center  
Discipline
Computer Science (401) 
Subdiscipline
11.07 Computer Science 

Abstract

As the data-driven economy evolves, enterprises have come to realize a competitive advantage in being able to act on high volume, high-velocity streams of data.  Technologies such as distributed message queues and streaming processing platforms that can scale to handle thousands of data stream partitions on commodity hardware are a response. However, the programming API provided by these systems is often too low-level, requiring substantial custom code that adds to the maintenance overhead. Additionally, these types of systems lack support for the SQL-based querying capabilities that have proven popular on Hadoop-based systems like Hive, Impala or Presto.  We define a minimal set of extensions to standard SQL for fast data querying and manipulation.  These extensions are prototyped in SamzaSQL, a new tool for streaming SQL that compiles streaming SQL into physical plans that are executed on Samza, an open-source distributed stream processing framework. We compare the performance of streaming SQL queries against native Samza applications and discuss usability improvements when compared with native stream processing applications written in Samza's Java API. SamzaSQL is an open-source, Apache project, and available for general use.

Intellectual Merit

We are exploring a new streaming SQL language with minimal extensions to the standard SQL. This allows to utilize existing query planning framework and optimizations from standard SQL in a streaming context. This work also extends popular query planning framework with streaming specific optimizations that can be re-used in a wide variety of stream processing platforms.

Broader Impacts

If the language we proposed is proven viable for modern stream querying capabilities, we are planning to come up with a streaming SQL Test Compatibility Kit for use when implementing the language in different settings. Also, contributions we made to query planning system is currently being utilize by two other projects (Apache Samza and Apache Flink) and we believe that providing re-usable stream query planning framework will have greater impact on reducing time and resources it takes to build new stream management systems.

Scale of Use

I am planning to run a set of experiments involving 1 or 2 node YARN cluster, 3 node Kafka cluster and 3 node Zookeeper cluster. Zookeeper cluster doesn't require that much resources and should be able to run 2 instances in one node.