Learning a vector representation of protein sequences

Project Details

Project Lead
Yuxiang Jiang 
Project Manager
Yuxiang Jiang 
Project Members
Achal Shah  
Indiana University Bloomington, Computer Science  
Computer Science (401) 
11.07 Computer Science 


Computational methods for protein function prediction play an important role in the era of inexpensive sequencing. This is partially due to the gap between abundant unannotated biological sequences and the resource-intensive experiment. In order to develop machine learning tools that can apply on protein sequences (as its primary structure), a good representation is required in the first place. We would explore this direction mainly using deep learning architectures in this project, and to investigate the quality of features learned in terms of the prediction performance of a protein's function.

Intellectual Merit

The learned protein sequence representation would be beneficial to apply a wide variety of standard machine learning methods and produce potentially better functional annotations to proteins. Computational methods would be used to assist biocurators/biologist to gain insights of what an unknown sequence does on the molecular level.

Broader Impacts

With a better protein function predictor, people gain biological insights and also contributes to other related fields such as drug development.

Scale of Use

A few GPU for several hours per week. This project would probably last 3 months.