Word Sense Disambiguation for Web 2.0 Data

Project Details

Project Lead
Jonathan Klinginsmith 
Project Manager
Jonathan Klinginsmith 
Supporting Experts
Xiaoming Gao  
Institution
Indiana University, School of Informatics  
Discipline
Computer Science (401) 

Abstract

In this work we plan to create an architecture that will allow for a variety of parallel similarity and parallel clustering algorithms to be tested and developed to be run against Web 2.0 data. These algorithms will be used to analyze emerging semantics and word senses within the data.

Intellectual Merit

User generated data on the Web is but one example of where researchers are seeing the challenges of "big data." This data phenomena can be described as a problem of where large datasets are being generated and updated to scales where it becomes difficult to store, manage, and visualize among other challenges. This project will allow students and researchers to investigate the challenges of big data from a computer science and engineering perspective. The goal of this project is to specifically investigate a natural language processing problem (word sense disambiguation) that will provide results to the specific problem as well as provide information to the greater context of the big data paradigm. The project is supported by two faculty members and a Ph.D. student in computer science. Insight gained from this project will benefit the following research communities: natural language processing, information modeling, as well as cloud and grid computing.

Broader Impacts

The broader impact of this project is to provide a Ph.D. student a dissertation topic that can then be expanded into future teachings for students at Indiana University. The project ties well into Indiana's School of Informatics and Computing mission teaching and researching computing and information technology topics while integrating these topics into scientific and human issues. The results of this project will allow other institutions to utilize the methodologies and framework to perform the same experiments.

Scale of Use

Around ten VMs to run experiments. We will use these VMs many times over the course of a couple of months to test a variety of algorithms.

Results

Using this project we realized there was a gap in researchers creating reproducible eScience experiments in the cloud. So, the research shifted to tackle this problem. Towards this goal, we had a paper accepted to the 3rd IEEE International Conference on Cloud Computing Science and Technology titled "Towards Reproducible eScience in the Cloud."
 
(
http://www.ds.unipi.gr/cloudcom2011/program/accepted-papers.html).

In this work, we demonstrated the following:

  • The construction of scalable computing environments into two distinct layers: (1) the infrastructure layer and (2) the software layer.
  • A demonstration through this separation of concerns that the installation and configuration operations performed within the software layer can be re-used in separate clouds.
  • The creation of two distinct types of computational clusters, utilizing the framework.
  • Two fully reproducible eScience experiments built on top of the framework.