High Performance Data Analytics with DSC-SPIDAL

Project Details

Project Lead
Saliya Ekanayake 
Project Manager
Saliya Ekanayake 
Project Members
Geoffrey Fox, Pulasthi Wickramasinghe, Md Enayat Ullah, Pulasthi Wickramasinghe, Madhuri Upadrasta, Kannan Govindarajan, Matthew Klosak, Jackie Han  
Institution
Virginia Tech, Network Dynamics and Simulation Science Laboratory (NDSSL)  
Discipline
Computer Science (401) 

Abstract

Scalable parallel interoperable data analytics (SPIDAL) is a novel library being developed by Digital science center (DSC) at Indiana University. The library is available in GitHub at [1]. At its current version the library provides high performance multi-dimensional scaling, pairwise clustering, and vector clustering algorithms. Further, we provide a data visualization tool that'll enable scientists to visually evaluate and form conclusions about results in 3D. We have sucessfully analysed life science data with these in the past [2,3] and plan to extend to larger datasets with this project.

[1] https://github.com/orgs/DSC-SPIDAL
[2] Yang Ruan, Geoffrey L. House, Saliya Ekanayake, Ursel Schütte, James D. Bever, Haixu Tang and Geoffrey Fox Integration of Clustering and Multidimensional Scaling to Determine Phylogenetic Trees as Spherical Phylograms Visualized in 3 Dimensions FIRST INTERNATIONAL WORKSHOP ON CLOUD FOR BIO (C4Bio 2014) to be held as part of CCGrid2014, the 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing Chicago Illinois May 27-29 2014
[3] Yang Ruan, Saliya Ekanayake, Mina Rho, Haixu Tang, Seung-Hee Bae, Judy Qiu, Geoffrey Fox DACIDR: Deterministic Annealed Clustering with Interpolative Dimension Reduction using a Large Collection of 16S rRNA Sequences ACM Conference on Bioinformatics, Computational Biology and Biomedicine (ACM BCB) Orlando Florida October 7-10 2012

Intellectual Merit

The performance challenges in global optimal styled algorithms are not well studied in the community and this experiment will enable us to perform critical evaluations on the performance of these algorithms and software stacks while being able to analyse real data for our collaborators.

Broader Impacts

If we can successfully complete this it'll enable biologists and other collaborators to cluster and visualize large amounts of data. Further, it'll serve as a benchmark for the algorithms tested and for others algorithms in the same class in general

Scale of Use

We would need around 32 nodes for allocated for couple of days at a time for couple of months.