Course: ILS-Z604 Big Data Analytics for Web and Text - SP14 Group #2

Project Details

Project Lead
Trevor Edelblute 
Project Manager
Trevor Edelblute 
Project Members
Siyuan Guo  
Institution
Indiana University, Department of Information & Library Science, School of Informatics & Computing  
Discipline
Computer Science (401) 

Abstract

FutureGrid resources (Eucalyptus Hadoop/MapReduce) will facilitate the completion of a data science project for ILS-Z604 at Indiana University. We propose to apply topic modeling algorithms to the collection of documents within a dataset from the HathiTrust Research Center. While we will likely begin by performing word frequency counts on the data, we ultimately are looking to identify patterns and semantic meaning in the corpus. Depending on the scope of the assignments and available resources, we may also create visualization to aid in the communication of topical relationships.

Intellectual Merit

FutureGrid resources will further the course learning outcomes: --The most important and current grounding philosophies, theories, and models for data science --How to view real-world problems from lens of these theories and models and solve these problems using the data science perspective (case studies) --Basic data processing and statistical analysis methods --Basic machine learning, data retrieval, ranking, and recommendation algorithms --Basics of R, Lucene, Hadoop and NoSQL (MongoDB), in lab sessions

Broader Impacts

By completing the course-based research project, the team may identify opportunities for identify and correcting errors resulting from the optical character recognition process, and apply a working solution to a large corpus (250,000 volumes, ~500GB). Results and methods will be documented in a final paper or in-class presentation.

Scale of Use

We will use the system intermittently for the remainder of the spring 2014 semester.

Results