Analyzing Large-scale Cancer Genomics Sequencing Data with Next Generation Sequencing (NGS) Data Analysis Tools in Hybrid Cloud

Project Details

Project Lead
Linda McMahan 
Project Manager
Linda McMahan 
Supporting Experts
Yang Ruan  
Institution
Memorial Sloan-Kettering Cancer Center, Human Oncology & Pathogenesis  
Discipline
Biology (603) 

Abstract

I and future members of this project will use the FutureGrid computing resource to apply NGS data analysis tools to the MapReduce framework of the recent published HugeSeq computational pipeline (http://www.nature.com/nbt/journal/v30/n3/full/nbt.2134.html) to analyze large-scale cancer genomics sequencing data* in hybrid cloud for detection and annotation of SNPs (single nucleotide polymorphisms), Indels (insertions or deletions), SVs (structural variations such as translocations and inversions) and CNVs (copy number variants) in cancer genomics data.  This project will use Virtual Machines (VMs) and hybrid cloud computing to accelerate NGS data analyses in keeping pace with the speed at which NGS data are generated as well as in leveraging new advances in genomics and personalized (sometimes referred to as “individualized” or the preferred term “precision”) medicine.
 
*E.g exom and whole genome sequence data publicly available from the Sequence Read Archive (http://blogs.nature.com/news/2011/10/sequencing_data_archive_resurf.html), the Cancer Genome Atlas (http://cancergenome.nih.gov/), the new Cancer Genomics Hub at UCSC (https://cghub.ucsc.edu/), the Catalogue of Somatic Mutations in Cancer database (http://www.sanger.ac.uk/resources/databases/cosmic.html), the Pediatric Cancer Genome Project (http://www.stjude.org/whole-genome-data), and the International Cancer genome Consortium (http://www.sanger.ac.uk/research/areas/humangenetics/icgc.html).

Intellectual Merit

This project will use the resources provided by FutureGrid to explore, test and analyze different NGS data analysis tools for detection and annotation of all types of genetic variations (e.g. SNPs, Indels, SVs and CNVs) in large-scale sequencing data of cancer genomics using the MapReduce framework of the HugeSeq computational pipeline. This project will also exploit Virtual Machines and hybrid cloud computing to develop a portable and stand-alone hybrid cloud-enable VM software package (bundled with the aforementioned NGS tools in the HugeSeq MapReduce framework) for researchers in genomics medicine to run computationally intensive NGS data analyses easily in hybrid cloud - Which keeps sensitive data in private cloud, while providing (especially to those without extensive bioinformatics resources) the scalability, computational resources and cost-effectiveness of the public cloud.

Broader Impacts

The proposed portable hybrid cloud-enable VM package developed in this project will be available for (1) download as an open source software through Sourceforge, (2) researchers in medical genomics and NGS data analysis research community at large to analyze large-scale sequence data in hybrid cloud for detection and annotation all types of genetic variations (SNPs, Indels, SVs and CNVs) in the genomic sequences, and (3) education, training and outreach on NGS data analyses in the cloud through online tutorials, online classes freely accessed by everyone worldwide through Coursera (https://www.coursera.org/), webinars, and workshops.

Scale of Use

A few VMs to test and process small-scale cancer genomics sequence data. The running time of the VMs will be dependent on the NGS analysis tools being tested in the HugeSeq MapReduce framework. If possible, I would like a long term usage of the service to store small-scale sequence data and perform ongoing exploring, testing and analyzing different NGS data analysis tools for detection and annotation of all types of genetic variations in genomics sequence data.