Keyword-based Semantic Association Discovery

Project Details

Project Lead
Mo Zhou 
Project Manager
Mo Zhou 
Project Members
Yifan Pan  
Indiana University , Computer science  
Computer Science (401) 


In many domains, such as social networks, cheminformatics, bioinformatics, and health informatics, data can be represented naturally in graph model, with nodes being data entries and edges the relationships between them. The graph nature of these data brings opportunities and challenges to data storage and retrieval. In particular, it opens the doors to search problems such as semantic association discovery [13, 14, 15] and semantic search [2, 10, 11]. We study the application requirements in these domains and find that discovering Constraint Acyclic Paths is highly in demand. In this paper, we define the CAP problem and propose a set of quantitative metrics for describing keyword-based constraints. We introduce cSPARQL to integrate CAP queries into SPARQL. We propose a series of algorithms to efficiently evaluate core CAP, a critical fragment of CAP queries, on large scale graph data. To further enhance the scalability of the query evaluation, we propose to parallel the algorithm using MapReduce on Hadoop. Extensive experiments illustrate that our algorithms are efficient in answering CAP queries. Applying our technologies to scientific domains has draw interests from domain experts.

Intellectual Merit

I guess the reference provided below is for broader impact. But I did search for what intellectual merit refers to. We proposed a new keyword-based search problem that is based on the coverage and relevance where the coverage represents the fraction of the keywords covered in a path and the relevance represents the fraction of labels in a path appearing in the keyword set. We furture propose three families of alrogithms to efficiently evaluate the search queries, where the last family is based on parallel approach using MapReduce.

Broader Impacts

Our work is based on the study of the requirement of the bioinformatics, cheminformatics, social networks and federal security. The new type of search we proposed can greatly facilitate many applications in those domains, such as drug discovery and security check. The search we proposed helps find complex meaningful relationships that are otherwise very difficult to identity in a very large datasets.

Scale of Use

We may use at most 16 nodes at a time. Each task will end less than an hour but we do need some intensive testing for many tasks. Thus we may request the resources for three or four days.