CloVR - Cloud Virtual Resource for Automated Sequence Analysis From Your Desktop

Project Details

Project Lead
Samuel Angiuoli 
Project Manager
Samuel Angiuoli 
Project Members
Saliya Ekanayake  
Institution
University of Maryland, Institute for Genome Sciences  
Discipline
Biology (603) 

Abstract

Background: Recently, second-generation sequencing platforms (e.g. 454, Illumina, Solid) have made genomic tools affordable and increased their popularity to the broader research community. However, demands in computational resources and lack of standardized analysis tools are increasingly representing a bottleneck in the bioinformatics analysis of large-scale sequence data. Results: Here, we present the Cloud Virtual Resource (CloVR) software package that takes advantage of two technologies, Virtual Machines and Cloud computing, to provide a new community resource for sequence analysis, suitable for large-scale sequencing projects. CloVR is available as an Open Source virtual machine at http://clovr.org and bundles pre-installed and pre-configured bioinformatics tools into automated pipelines. With the CloVR virtual machine, users have the option to run supported pipelines on their local computers and to utilize scalable on-demand Cloud computing services to perform CPU-intensive tasks on the Internet without having to install additional software. In order to support a large variety of different sequencing projects, the CloVR virtual machine is composed of separate sequence analysis tracks. Each track within CloVR is comprised of the entire suite of Open Source software tools necessary to support a fully automated analysis as required in a typical genomics project. Currently supported applications include BLAST search (CloVR-Search) single microbial whole-genome shotgun (WGS) assembly and annotation (CloVR-Microbe), metagenomic WGS assembly, gene prediction and BLAST comparison (CloVR-Metagenomics), and 16S phylogeny (CloVR-16S). CloVR currently supports VMware for local execution and the commercial Amazon EC2 Cloud (http://aws.amazon.com/ec2/) and the academic free Nimbus Science Clouds (http://www.scienceclouds.org/). Conclusion: CloVR is a genomics tool that enables any researcher with a sequencing machine and an Internet connection to perform complex and computationally demanding sequence analysis and join the genomic revolution.

Intellectual Merit

Less than 15 years after the first complete sequencing of a bacterial genome, sequence analysis has now become an integral part of nearly all research areas in biology. Recently, sequencing expenses have dropped sharply due to the affordability of second generation sequencing technology leading to the establishment of an increasing number of small genome sequencing facilities. Despite the increased rate of sequence generation, there has not been a commensurate increase in access to computational resources to support high-quality sequence processing and analysis. In particular, some of the investigators new to the field who are now obtaining next generation sequencing platforms could be insufficiently prepared to take full advantage of their own high-throughput sequencing devices. This proposal intends to close this technological gap by increasing the accessibility of state-of-the-art sequence analysis software to researchers without extensive bioinformatics resources. The proposal describes development of a portable and stand-alone software package, using Virtual Machines (VM), that will incorporate readily available, open source tools for genome analysis. The VM design will provide two main advantages, allowing users to 1) circumvent complex software installations and 2) avoid performance bottlenecks of local computing networks. First, the VM package will include fully operational bioinformatics pipelines within a single executable file that is compatible with all computer operating systems and makes further software installations unnecessary. Second, due to the VM portability, the processing of large sequence data can be outsourced to large distributed computing networks called compute clouds. The analysis protocols provided on the VM package will replicate and extend established bioinformatics protocols and include tools for whole genome and metagenome annotation and comparative analysis, including sequence assembly, gene prediction, functional annotation, metabolic pathway reconstruction, and phylogenetic classification. The availability of the proposed open source and cloud-enabled VM package will increase the usability of microbial genome sequencing to a broad user community.

Broader Impacts

CloVR is a 2 year old project with funding from NSF and NIH to simplify and automate large-scale bioinformatics for sequence analysis. We have a beta release of the VM already available, a set of beta testers, and plan a full stable release in the next few months. The broader impact section from the grant was described as: 1) Release of portable VM tool package. The proposed VM package will be made available as a work in progress with at least two trial and four production releases during the funded period. It will be available for download as an open source software tool through the project webpage. 2) Education, training and outreach. The aim of this proposal is to increase the availability of state-of-the-art microbial sequence analysis tools to small research groups. The VM package will be extensively advertised and documented through publications, conference presentations, the project website and an online blog. In addition, an online seminar ("webinar") will be offered that will use the World Wide Web to teach the basics of microbial genome analysis using a test set of sequence data distributed together with the VM package. The VM package and accompanying online classes will be developed in close dialogue with the scientific community, who will have the opportunity to subscribe to the online blog with associated discussion forum. 3) De-centralized microbial sequence analysis. Genome analysis can provide significant benefits to many areas of microbial research. The release of next-generation sequencing technologies promotes a new model of affordable, de-centralized microbial sequence analysis with benefits for the entire scientific community. The proposed portable, open source, microbial sequence analysis package will contribute to the success of this model.

Scale of Use

A dozen or so VMs several times a week to test and process experimental data

Results

We have two publications describing the CloVR VM in press. The abstract for the main paper is Background: Next-generation sequencing technologies have decentralized sequence acquisition, increasing the demand for new bioinformatics tools that are easy to use, portable across multiple platforms, and scalable for high-throughput applications. Cloud computing platforms provide on-demand access to computing infrastructure over the Internet and can be used in combination with custom built virtual machines to distribute pre-packaged with pre-configured software. Results: We describe the Cloud Virtual Resource, CloVR, a new desktop application for push-button automated sequence analysis that can utilize cloud computing resources. CloVR is implemented as a single portable virtual machine (VM) that provides several automated analysis pipelines for microbial genomics, including 16S, whole genome and metagenome sequence analysis. The CloVR VM runs on a personal computer, utilizes local computer resources and requires minimal installation, addressing key challenges in deploying bioinformatics workflows. In addition CloVR supports use of remote cloud computing resources to improve performance for large-scale sequence processing. In a case study, we demonstrate the use of CloVR to automatically process next-generation sequencing data on multiple cloud computing platforms. Conclusion: The CloVR VM and associated architecture lowers the barrier of entry for utilizing complex analysis protocols on both local single- and multi-core computers and cloud systems for high throughput data processing.