PubMed MEDLINE Clustered Search
PubMed MEDLINE is one of the most popular and most used corpus of bio-medical, medical and pharmaceutical documents. With over 5.2 billion searches made on the NCBI website on MEDLINE corpus, it is certainly of great value to the research community. Our goal is to cluster the entire PubMed corpus - which has ~ 28 million documents, using distributed programming frameworks such as Hadoop, Harp and H2O. Several unsupervised clustering algorithms are considered, primary amongst these being MiniBatchKmeans algorithm. The ultimate aim of the project is provide a comprehensive search facility on MEDLINE documents, which will be built on top of the clustered results. This educational/research project is part of the INFO I590: Practice in Data Science graduate course. We are currently using high performance computing machinery such as Karst, Mason, etc. at IU.
Use of FutureSystems
We will primarily use FutureSystems to setup a Hadoop/Harp cluster and run clustering jobs on them. During advanced stages of the project, we will also use GPUs to accelerate our computation.
Scale of Use
We will have one set of VMs(10-15 extra large VMs) almost constantly running and utilizing resources. All our project members will share these VMs to submit clustering jobs and examine results.