Data mining samples based on Twister

Project Details

Project Lead
Zhanquan Sun 
Project Manager
Zhanquan Sun 
Supporting Experts
Bingjing Zhang, Yang Ruan  
Institution
Indiana University Bloomington, Pervasive Technology Institute  
Discipline
Computer Science (401) 
Subdiscipline
11.01 Computer and Information Sciences, General 

Abstract

Large scale data are collected in many kinds of application fields. How to explore useful knowledge from the raw data to support decision is a very popular research topic. Many data intensive data mining technologies have been developed. Map/Reduce is considered as the most efficient one. Twister software developed by PTI is based on iterative Map/Reduce technology. It improves the computation speed markedly in dealing with data intensive problems. The project mainly studies the realization of several data mining methods, such as SVM, Apriori, correlation analysis based on Entropy and so on, based on Twister. The program will be developed under Linux platform with Java. The methods are used to analyze bioinformatics or biomedical examples.

Intellectual Merit

Commonly used data mining methods have been widely studied and become mature. Most of them are serial and cannot be used to analyze large scale data. How to parallelize data mining methods is a very hard task. It needs to study and redesign the structure of data mining algorithms so that they can be realized in parallel. Through the study of data mining method programming, it can both improve the computation efficiency and keep the data analysis precision.

Broader Impacts

The research results of the projects can be used in many kinds of large scale data mining fields. It will support the user to find useful knowledge from large scale raw data. It will contribute the development of large scale data analysis research.

Scale of Use

A few VMs for an experiment.