Use of ILP for cancer driver prioritization

Project Details

Project Lead
Ermin Hodzic 
Project Manager
Ermin Hodzic 
Project Members
Yen-Yi Lin, Ibrahim Numanagic, Alex Gawronski  
Supporting Experts
Indiana University, Computer Science Department  
Computer Science (401) 


A key challenge in cancer genomics is the identification and prioritization of genomic aberrations that potentially act as drivers of cancer. We introduce a combinatorial method to identify aberrant genes that can collectively influence possibly distant "outlier" genes based on what we call the "random-walk facility location" (RWFL) problem on an interaction network. RWFL differs from the standard facility location problem by its use of "multi-hitting time", the minimum expected number of hops in a random walk originating from any aberrant gene to reach an outlier. WE thus aim to find the smallest set of aberrant genes from which one can reach outliers within a desired multi-hitting time. For that it estimates multi-hitting time based on the independent hitting times from the drivers to any given outlier and reduces the RWFL to a weighted multi-set cover problem, which it solves as an integer linear program (ILP).

Intellectual Merit

We present a novel integrative method that considers potential driver events at the genomic level, i.e. single nucleotide mutations, structural or copy number changes. We try to identify “the most parsimonious” set of patient specific driver genes which have sucient “influence” over a large proportion of outlier genes. We formulate this as a “random-walk facility location” problem (RWFL), a combinatorial optimization problem, which, to the best of our knowledge, has not been explored earlier. Since RWFL problem is NP-hard, we estimate the multi-hitting time based on the independent hitting times of the drivers to an outlier, which provides an upper bound on the multi-hitting time. Our experiments show that this estimate works well for the human protein interaction network. More importantly, our estimate enables us to reduce the RWFL problem to a weighted multi-set cover problem, for which we give an ILP formulation.

Broader Impacts

Introducing the multi-set cover method into prioritization of cancer drivers, providing bounds and ways to approximate multi-hitting time in networks and describing validation using module detection for purpose of classification.

Scale of Use

Will be running cplex/gurobi programs on datasets. Each run might take more than a day to solve.