Distributed Execution of Kepler Scientific Workflow on Future Grid

Project Details

Project Lead
Ilkay Altintas 
Project Manager
Ilkay Altintas 
Project Members
Jianwu Wang  
Supporting Experts
Hyungro Lee  
Computer Science (401) 


The open-source Kepler Scientific Workflow System (http://kepler-project.org) is collaboratively designed to help scientists, analysts, and computer programmers (a.k.a. workflow developers) create, execute, and share models and analyses across a broad range of scientific and engineering disciplines. Kepler can operate on data stored in a variety of formats, locally and over the internet, and is an effective environment for integrating disparate software components, such as merging "R" scripts with compiled "C" code, or facilitating remote, distributed execution of models. Using Kepler's graphical user interface, users simply select and then connect pertinent analytical components and data sources to create a "scientific workflow" an executable representation of the steps required to generate results. The Kepler software helps users share and reuse data, workflows, and components developed by the scientific community to address common needs. We would like to perform a detailed testbed for distributed workflow applications using Kepler on the resources provided by FutureGrid. This work will showcase the use of the current Kepler distribution mechanisms, categorize the application scope of each mechanism in terms of its data- and compute-intensive characteristics and cross-investigate their performances. These distribution approaches in Kepler include: *Master-Slave Distributed Execution: Client/Server architecture that uses RMI framework to distribute parts of a Kepler workflow (a.k.a. sub-workflows) and manages their parallel execution on a set of slave nodes. *Parameter Sweep using Nimrod/K: Nimrod-based execution of parameter-sweep applications via a Kepler workflow. *Map/Reduce for Data-Intensive Applications: Partitioning of data and data-intensive execution of sub-workflows as a Map/Reduce executables on a Hadoop cluster. *Grid-enabled High-Throughput Computing: Using a set of Globus actors in Kepler workflow to execute applications on Grid environments. Under the scope of this project, we wish to: *deploy Java 1.5 or higher, Kepler, RMI, Globus, a local job scheduler, Nimrod and Hadoop on virtual and non-virtual FutureGrid resources; *install our testing workflows, executables and datasets; *and execute a number of Kepler workflows utilizing the deployed infrastructure. Overall, the proposed work will demonstrate the deployment and usage of distributed Kepler workflows across the FutureGrid resources.

Intellectual Merit

We will make fundamental technology development advances in information technology at the intersection of scientific workflows and data-intensive distributed computing. We will provide a generic approach using scientific workflows to accelerate the process and demonstrate its usage on FutureGrid.

Broader Impacts

The impact is twofold. First, it helps users to speedup their domain-specific workflow applications by using Kepler distributed execution architectures and the infrastructure of FutureGrid. Secondly, it shows the capabilities of Kepler to simplify construction and support distributed workflow applications.

Scale of Use

On demand provisioning of virtual machines for verifying the feasibility and performance testing of the distributed execution architectures in Kepler workflow system. The number of VM instances is expected to vary, from very small (less than 5) to medium capacity (about 20). For testing purpose, the average time of VM uptime is 24 hour for each instantiation.