Learning to Summarize User-Generated Video

Project Details

Project Lead
Kristen Grauman 
Project Manager
Kristen Grauman 
Project Members
Dinesh Jayaraman, Ruohan Gao  
University of Texas at Austin, Department of Computer Science  
Computer Science (401) 


We are developing automatic video summarization algorithms, and supporting recognition methods that require large-scale computation. A video summarization algorithm must answer the question: which frames in the original video are worthy of appearing in the output summary? A rich literature in computer vision and multi-media has developed a plethora of methods to prioritize which subset of frames (or subshots) should be selected. Despite the progress, we contend that prior approaches are inherently impeded by their unsupervised nature. Specifically, they are plagued by three key limitations: (1) they rely on hand-crafted criteria that are believed to lead to good summaries; (2) they summarize by compression using temporally local information, thus making it difficult to optimize the criteria jointly on the selected frames as a set; (3) they often expect preprocessed inputs captured by professional photographers—e.g., newscasts, sports events, movies, TV shows—making them inadequate to respond to the urgent need to summarize unedited “user-generated” video, whose visual quality is significantly different from edited data.

Intellectual Merit

Video summarization has shifted into new territory, owing to the explosion of video data of many different types. From \emph{user-generated} videos to professionally edited content, the complexity and heterogeneity poses many new challenges. Unsupervised methods, which are the cornerstone of nearly all existing approaches, have become increasingly limiting due to their reliance on hand-crafted heuristics. By posing video summarization as a \emph{\textbf{supervised learning}} problem, we will investigate a markedly different formulation of the summarization task. Our research stands to make the following contributions to the scientific community in four thrusts: powerful and effective probabilistic models for learning to select the optimal subset of video frames for summarization, semi-supervised learning models and co-summarization algorithms for leveraging the abundance of multiple related videos, algorithms for exploiting photos on the Web to improve summarization, and evaluation protocols that assess summaries in a way that aligns well with human comprehension.

Broader Impacts

Our efforts address the urgent need to develop methods for summarizing user-generated videos---often mobile phone or wearable camera video---which have become the most common video genre on public sharing sites and private exchanges. Our research will yield practical tools for video summarization, which will greatly facilitate (video) information browsing, search, dissemination, and communication. Applications of reliable summarization systems with societal impact are abundant. Examples include a primatologist gathering long videos of her animal subjects, who could quickly browse a week's worth of their activity before deciding where to inspect the data most closely. A young student searching YouTube to learn about Yellowstone National Park could see at a glance what content exists, much better than today's simple thumbnail images can depict. A grandparent could navigate endless videos of the grandchildren, while an intelligence agent could rapidly sift through reams of aerial video. Our research advances the state of the art in machine learning and computer vision and appeals broadly to those communities as well as related ones such as multi-media processing. Research results including code and data will be publicly disseminated. PhD students will be trained interdisciplinarily, becoming experienced in both computer vision and machine learning. Both PIs actively participate in their institutions' outreach activities. Given the broad public appeal of research topics on information media, through these outreach efforts the PIs will be able to engage young generations in STEM education and career paths.

Scale of Use

At peak usage, we expect to be continuously using up to 20 GPU units on Delta for up to 7-10 days at a time (over multiple batches of processes). A reasonable estimate of average usage over the next few months could be: ~5 GPUs for about 10 hours each every week.