Deployment of Virtual Clusters on a Commercial Cloud Platform for Molecular Docking

Project Details

Project Lead
Anthony Nguyen 
Project Manager
Kohei Ichikawa 
Project Members
Anthony Nguyen, Yue Song, Katy Pham, Kohei Ichikawa, Jason Haga  
Institution
University of California, San Diego, PRIME/PRAGMA  
Discipline
Computer Science (401) 
Subdiscipline
26.0616 Biotechnology Research 

Abstract

The project aims to create and deploy virtual clusters that run a protein-ligand molecular interaction simulation program called DOCK to FutureGrid. This will allow tasks to be performed on a large scale cheaply and efficiently. Three areas will be investigated: 1) The elasticity of the virtual clusters, 2) the fault tolerance of the system, and 3) the use of several virtual clusters on various commercial clouds to form a single system. By utilizing commercial and other clouds, like FutureGrid, the system performance will be increased and will allow millions of protein-ligand interaction simulations to be run in a massively parallel manner.

Intellectual Merit

The proposed project emphasizes three specialized topics to flexibly scale and configure virtual machines onto a multi-cloud environment, and to detect and isolate failures of the system. The merit of this project is that it is a real-world test of a multi-cloud implementation of the virtual screening software DOCK. The results from testing this system, will be compared to our previous results using single clusters or grid computing. It is expected that this work will provide a more consistent, stable, and easy-to-use virtual screening workflow on a very large-scale. Additionally, with the large amount of resources on a cloud, the number of simulations that can be run on a virtual cluster is easily scalable with Hadoop/MapReduce, thus this system can adapt to the ever increasing size of chemical compound databases. As the result, the system will facilitate a better understanding and mapping of protein-ligand interactions furthering knowledge of protein signaling networks in the human body.

Broader Impacts

The deployment of the DOCK simulation program onto clouds will allow for 2 distinct improvements that will enhance the process of drug discovery: 1) significant increase in availability of resources regardless of a computer science knowledge background and 2) decrease in the duration of the overall process of drug discovery. First, by having DOCK simulation program available on clouds, this allows other scientists to have access to this essential tool that increases the speed of drug discovery through its ability to screen millions of compounds in a short amount of time. The intended user will not need extensive knowledge in computer science and the user will not have large financial burdens resulting from the procurement and maintenance of hardware needed for grid computing. Second, by having the DOCK simulation program available on cloud for use by many different communities, this will facilitate the interactions between different scientists and lead to a positive impact on the drug discovery process in the pharmaceutical industry.

Scale of Use

Scale of use depends on which task is being done. There are 3 tasks that are to be accomplished. First is doing multi-cloud work that involves virtual clusters on FutureGrid communicating with virtual clusters on other clouds. For this task, only about 5-10 VMs will be used. The other two tasks involve looking at Fault Tolerance and Elasticity of the Virtual Machine Clusters. For these two tasks, hundreds to thousands of VMs will be used in conjunction with Hadoop/MapReduce as we are hoping to observe the limitations of this method of computing.

Results

By the conclusion of this project, we expect to have a functional set of virtual clusters running protein-ligand interaction simulations. These virtual clusters will be able to be networked with similar virtual clusters on different clouds to allow the tasks to be split amongst more resources. We will also have an understanding of the limits of the system through our fault tolerance and elasticity testing.