High Performance Data Analytics with DSC-SPIDAL
Project Details
- Project Lead
- Saliya Ekanayake
- Project Manager
- Saliya Ekanayake
- Project Members
- Geoffrey Fox, Pulasthi Wickramasinghe, Md Enayat Ullah, Pulasthi Wickramasinghe, Madhuri Upadrasta, Kannan Govindarajan, Matthew Klosak, Jackie Han
- Institution
- Virginia Tech, Network Dynamics and Simulation Science Laboratory (NDSSL)
- Discipline
- Computer Science (401)
Abstract
Scalable parallel interoperable data analytics (SPIDAL) is a novel library being developed by Digital science center (DSC) at Indiana University. The library is available in GitHub at [1]. At its current version the library provides high performance multi-dimensional scaling, pairwise clustering, and vector clustering algorithms. Further, we provide a data visualization tool that'll enable scientists to visually evaluate and form conclusions about results in 3D. We have sucessfully analysed life science data with these in the past [2,3] and plan to extend to larger datasets with this project.
[1] https://github.com/orgs/DSC-SPIDAL
[2] Yang Ruan, Geoffrey L. House, Saliya Ekanayake, Ursel Schütte, James D. Bever, Haixu Tang and Geoffrey Fox Integration of Clustering and Multidimensional Scaling to Determine Phylogenetic Trees as Spherical Phylograms Visualized in 3 Dimensions FIRST INTERNATIONAL WORKSHOP ON CLOUD FOR BIO (C4Bio 2014) to be held as part of CCGrid2014, the 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing Chicago Illinois May 27-29 2014
[3] Yang Ruan, Saliya Ekanayake, Mina Rho, Haixu Tang, Seung-Hee Bae, Judy Qiu, Geoffrey Fox DACIDR: Deterministic Annealed Clustering with Interpolative Dimension Reduction using a Large Collection of 16S rRNA Sequences ACM Conference on Bioinformatics, Computational Biology and Biomedicine (ACM BCB) Orlando Florida October 7-10 2012
Intellectual Merit
The performance challenges in global optimal styled algorithms are not well studied in the community and this experiment will enable us to perform critical evaluations on the performance of these algorithms and software stacks while being able to analyse real data for our collaborators.
Broader Impacts
If we can successfully complete this it'll enable biologists and other collaborators to cluster and visualize large amounts of data. Further, it'll serve as a benchmark for the algorithms tested and for others algorithms in the same class in general
Scale of Use
We would need around 32 nodes for allocated for couple of days at a time for couple of months.