Running Github Analysis with Spark

Project Details

Project Lead
Pulasthi Wickramasinghe 
Project Manager
Pulasthi Wickramasinghe 
Project Members
Geoffrey Fox, Pik-Mai Hui, Supun Kamburugamuve, Allan Streib, John Bollenbacher, Nathaniel Rodriguez, Diogo Pacheco  
University of Indiana, Digital Science Center  
Computer Science (401) 


The project aims at analyzing Github data using Apache Spark. Spark will be used to achieve high performance and to manage the large volume of data

Intellectual Merit

This will help understand performance challenges for large-scale analysis of real-world data using latest big data technology

Broader Impacts

If completed successfully this will help reduce the time taken to analyze certain queries on large datasets

Scale of Use

We would need 13 nodes for this analysis. to set-up the Apache Spark cluster