Large-scale agent-based simulation of Github

Project ID
FG-536
Project Categories
Computer Science
Completed
Abstract
The project is supported by DARPA grant. Our team consists of three sub-teams of PIs and Ph.D. students from University of South California, University of Notre Dame, and Indiana University Bloomington. The goal of the project is to develop a large-scale, detailed agent-based simulation that are capable of modelling and predicting real-world events. We are now focusing on simulation using data extracted from the GHTorrent project, which contains all data records of activities on Github. We are seeking high performance computing platform that are suitable of hosting our datasets and of performing analysis on the dataset. Potential technologies we would like to use are Spark and Hadoop. A rough estimation of the size of the dataset is anywhere from 2TB to 5TB uncompressed.
Use of FutureSystems
We would like to perform data analysis, such as measuring distribution of user activities and repo activities on the GHTorrent dataset, which spans TB in size.
Scale of Use
We would like enough VCore for reasonable speed of our measurements, and disk space to host our dataset locally. The project will likely last for at least two semesters, and possibly longer.