Distributed Execution of Kepler Scientific Workflow on Future Grid

Project ID
Project Categories
Technology Evaluation
The open-source Kepler Scientific Workflow System (http://kepler-project.org) is collaboratively designed to help scientists, analysts, and computer programmers (a.k.a. workflow developers) create, execute, and share models and analyses across a broad range of scientific and engineering disciplines. Kepler can operate on data stored in a variety of formats, locally and over the internet, and is an effective environment for integrating disparate software components, such as merging "R" scripts with compiled "C" code, or facilitating remote, distributed execution of models. Using Kepler's graphical user interface, users simply select and then connect pertinent analytical components and data sources to create a "scientific workflow" an executable representation of the steps required to generate results. The Kepler software helps users share and reuse data, workflows, and components developed by the scientific community to address common needs. We would like to perform a detailed testbed for distributed workflow applications using Kepler on the resources provided by FutureGrid. This work will showcase the use of the current Kepler distribution mechanisms, categorize the application scope of each mechanism in terms of its data- and compute-intensive characteristics and cross-investigate their performances. These distribution approaches in Kepler include: *Master-Slave Distributed Execution: Client/Server architecture that uses RMI framework to distribute parts of a Kepler workflow (a.k.a. sub-workflows) and manages their parallel execution on a set of slave nodes. *Parameter Sweep using Nimrod/K: Nimrod-based execution of parameter-sweep applications via a Kepler workflow. *Map/Reduce for Data-Intensive Applications: Partitioning of data and data-intensive execution of sub-workflows as a Map/Reduce executables on a Hadoop cluster. *Grid-enabled High-Throughput Computing: Using a set of Globus actors in Kepler workflow to execute applications on Grid environments. Under the scope of this project, we wish to: *deploy Java 1.5 or higher, Kepler, RMI, Globus, a local job scheduler, Nimrod and Hadoop on virtual and non-virtual FutureGrid resources; *install our testing workflows, executables and datasets; *and execute a number of Kepler workflows utilizing the deployed infrastructure. Overall, the proposed work will demonstrate the deployment and usage of distributed Kepler workflows across the FutureGrid resources.
Use of FutureSystems
Use as infrastructure for the distributed execution research of Kepler scientific workflow system.
Scale of Use
On demand provisioning of virtual machines for verifying the feasibility and performance testing of the distributed execution architectures in Kepler workflow system. The number of VM instances is expected to vary, from very small (less than 5) to medium capacity (about 20). For testing purpose, the average time of VM uptime is 24 hour for each instantiation.