Course: Applied Cyberinfrastructure concepts

Project ID
FG-363
Project Categories
Computer Science
Completed
Abstract
The resources provided by FutureGrid will be utilized by students enrolled in Applied Cyberinfrastructure Concepts ( ISTA 420/520 Fall 2013) at the University of Arizona. This project based learning class will introduce fundamental concepts, tools and resources for effectively managing common tasks associated with analyzing large datasets. Providing familiarity with cyberinfrastrucutre (CI) resources available at the University of Arizona campus, iPlant Collaborative, NSF XSEDE centers, FutureGrid and commercial providers such as Amazon. Students will learning to apply relevant CI skills (for final project) and develop wiki based documentation of these best practices, learning how to effectively collaborate in interdisciplinary team settings and targeting the optimal CI resources and tools for their project. The course will comprise of series of guest lectures by subject matter experts from projects that have developed widely adopted foundational Cyberinfrastrcutrue resources, followed by hands on laboratory exercises focused around those resources (some of which will be tailored for cloud resources on FutureGrid). The students will utilize these resources and gain practical experience from laboratory exercises for a final project. The final project will include data set and requirements provided by domain scientists (Genomics and Geosciences). Students will be provided access to compute resources at: UA campus clusters, iPlant Collaborative and at NSF XSEDE and FutureGrid. Students will also learn how to write a proposal for obtaining future allocation to large scale national resources through XSEDE.
Use of FutureSystems
Students will learn how to create custom VM for their specific projects which require full LAMP stack to support a integrated bioinformatics application
These projects will include a) use of iRODS for federated data handling b) Makeflow (cctools from Doug Thain @ Notre Dame) and Pegasus (Eva Deelman @ USC) to provide scale out of tasks . Students may also choose to implement a condor cluster or star cluster for managing their tasks http://web.mit.edu/star/hpc/index.html
The above mentioned applications will store certain data components using NoSQL database (Mongo) and Key Value store (Redis) and ZeroMQ http://www.zeromq.org/ for concurrency framework
Hadoop will be utilized to scale out some of the queries from Mongo along side apache PIG
Above mentioned resources will integrate with similar resources running at iPlant and XSEDE resources (TACC) i.e iRODS federation with nodes in iPlant and makeflow workers executing at TACC
Students will be asked to choose different platforms/cloud infrastructure (OpenStack, Amazon) to implement their work and compare and contrast performance and capabilities.
Scale of Use
There are 24 students in the class, certain assignments will be based on FutureGrid resources and these will be conducted in groups of 4. I expect the while class to use of 4-10 VM for design/prototype (over a 2 week period) and then a scale out of 40 VM for final assignment (for the whole class). Hardware requirement will be between 2 to 4 core machines with 4 to 8 Gb RAM and 10-50 GB disk space (EBS style or some form of persistent storage)
All scale out will be done under guidance and collaboration with Future Grid team once the proof of concept VM's are functional.
All requirements are flexible and can be tailored to resources available