Analyzing Large-scale Cancer Genomics Sequencing Data with Next Generation Sequencing (NGS) Data Analysis Tools in Hybrid Cloud

Project ID
Project Categories
Life Science
I and future members of this project will use the FutureGrid computing resource to apply NGS data analysis tools to the MapReduce framework of the recent published HugeSeq computational pipeline ( to analyze large-scale cancer genomics sequencing data* in hybrid cloud for detection and annotation of SNPs (single nucleotide polymorphisms), Indels (insertions or deletions), SVs (structural variations such as translocations and inversions) and CNVs (copy number variants) in cancer genomics data.  This project will use Virtual Machines (VMs) and hybrid cloud computing to accelerate NGS data analyses in keeping pace with the speed at which NGS data are generated as well as in leveraging new advances in genomics and personalized (sometimes referred to as “individualized” or the preferred term “precision”) medicine.   *E.g exom and whole genome sequence data publicly available from the Sequence Read Archive (, the Cancer Genome Atlas (, the new Cancer Genomics Hub at UCSC (, the Catalogue of Somatic Mutations in Cancer database (, the Pediatric Cancer Genome Project (, and the International Cancer genome Consortium (
Use of FutureSystems
Plan to store small-scale cancer genomics sequence data, create and test VM on FutureGrid to test and apply NGS analysis tools using the MapReduce framework of the HugeSeq computational pipeline in hybrid cloud. Plan to make VM available to researchers in cancer genomics, genomics medicine and the NGS data analysis research communities at large.
Scale of Use
A few VMs to test and process small-scale cancer genomics sequence data. The running time of the VMs will be dependent on the NGS analysis tools being tested in the HugeSeq MapReduce framework. If possible, I would like a long term usage of the service to store small-scale sequence data and perform ongoing exploring, testing and analyzing different NGS data analysis tools for detection and annotation of all types of genetic variations in genomics sequence data.