Running Github Analysis with Spark
Project Details
- Project Lead
- Pulasthi Wickramasinghe
- Project Manager
- Pulasthi Wickramasinghe
- Project Members
- Geoffrey Fox, Pik-Mai Hui, Supun Kamburugamuve, Allan Streib, John Bollenbacher, Nathaniel Rodriguez, Diogo Pacheco
- Institution
- University of Indiana, Digital Science Center
- Discipline
- Computer Science (401)
Abstract
The project aims at analyzing Github data using Apache Spark. Spark will be used to achieve high performance and to manage the large volume of data
Intellectual Merit
This will help understand performance challenges for large-scale analysis of real-world data using latest big data technology
Broader Impacts
If completed successfully this will help reduce the time taken to analyze certain queries on large datasets
Scale of Use
We would need 13 nodes for this analysis. to set-up the Apache Spark cluster