PubMed MEDLINE Clustered Search

Project Details

Project Lead
ramji chandrasekaran 
Project Manager
ramji chandrasekaran 
Institution
Indiana University, Data Science  
Discipline
Computer Science (401) 
Subdiscipline
---603 Biology--- 

Abstract


PubMed MEDLINE is one of the most popular and most used corpus of bio-medical, medical and pharmaceutical documents. With over 5.2 billion searches made on the NCBI website on MEDLINE corpus, it is certainly of great value to the research community. Our goal is to cluster the entire PubMed corpus - which has ~ 28 million documents, using distributed programming frameworks such as Hadoop, Harp and H2O. Several unsupervised clustering algorithms are considered, primary amongst these being MiniBatchKmeans algorithm. The ultimate aim of the project is provide a comprehensive search facility on MEDLINE documents, which will be built on top of the clustered results. This educational/research project is part of the INFO I590: Practice in Data Science graduate course. We are currently using high performance computing machinery such as Karst, Mason, etc. at IU.

Intellectual Merit

The project provides an excellent opportunity to students in the following areas 1. unsupervised learning on large datasets 2. information retrieval and indexing 3. distributed programing paradigms 4. topic modelling

Broader Impacts

The end product of this project is a comprehensive search interface custom built for MEDLINE and other research journal datasets. Researchers typically have to spend hours rummaging through results of traditional generic search engines to find the documents of interest. By clustering these documents and building an index on top of that, we will be able to provide a more focused and concise search result, thus reducing researchers' time on information retrieval. The project will also provide a content driven topical search on journals.

Scale of Use

We will have one set of VMs(10-15 extra large VMs) almost constantly running and utilizing resources. All our project members will share these VMs to submit clustering jobs and examine results.

Results