Course: IPython pipelines for training life sciences researchers on NGS data analysis
Genomes and transcriptomes encapsulate the story of living things in the form of nucleic acid sequences. Petabytes of sequence data are publicly available now, the product of recent advances in DNA sequencing technology (Next-Generation Sequencing, NGS). To effectively exploit these resources, research scientists, clinicians and students in the life sciences need to become familiar with (1) the Linux command line, (2) at least one scripting language (e.g., perl or python) and (3) high performance computing. This project uses NGS data, high performance computing and IPython running on Linux virtual machines (VMs) to elucidate molecular mechanisms of epigenetic gene regulation, RNA polymerase function and RNA-directed DNA methylation (RdDM) in the model organism Arabidopsis. During the course of the project (~1 year), postdoctoral researchers and graduate students will be trained to use standard bioinformatic tools for NGS data analysis, including bowtie, samtools, R packages and scientific python. We are developing python scripts that string these tools together into pipelines for genetic mapping of single nucleotide polymorphisms (SNPs) and analysis of RNA sequencing (RNA-seq) datasets. We will use Linux VMs on FutureGrid to tackle important research questions in the field of RdDM, while developing training materials (Git repositories and IPython notebooks) for teaching advanced undergraduate students, graduate students and postdocs to use command line tools for short read alignment and analysis.
Use of FutureSystems
The project will provision several xlarge VMs in the OpenStack environment for 1-2 week intervals, with custom-installed bioinformatic tools to provide remote access to NGS analysis pipelines for up to 6 student/post-doc users.
Scale of Use
Up to four VMs for bioinformatic analyses and student/postdoc training, as well as cycles on the india, sierra or xray HPC clusters as available.