Learning a vector representation of protein sequences

Project ID
FG-534
Project Categories
Computer Science
Completed
Abstract
Computational methods for protein function prediction play an important role in the era of inexpensive sequencing. This is partially due to the gap between abundant unannotated biological sequences and the resource-intensive experiment. In order to develop machine learning tools that can apply on protein sequences (as its primary structure), a good representation is required in the first place. We would explore this direction mainly using deep learning architectures in this project, and to investigate the quality of features learned in terms of the prediction performance of a protein's function.
Use of FutureSystems
I would deploy python code on FutureSystems which mostly requires GPU (e.g., TensorFlow), and collect results.
Scale of Use
A few GPU for several hours per week. This project would probably last 3 months.