Learning to Summarize User-Generated Video

Project ID
Project Categories
Computer Science
NSF Grant Number
We are developing automatic video summarization algorithms, and supporting recognition methods that require large-scale computation. A video summarization algorithm must answer the question: which frames in the original video are worthy of appearing in the output summary? A rich literature in computer vision and multi-media has developed a plethora of methods to prioritize which subset of frames (or subshots) should be selected. Despite the progress, we contend that prior approaches are inherently impeded by their unsupervised nature. Specifically, they are plagued by three key limitations: (1) they rely on hand-crafted criteria that are believed to lead to good summaries; (2) they summarize by compression using temporally local information, thus making it difficult to optimize the criteria jointly on the selected frames as a set; (3) they often expect preprocessed inputs captured by professional photographers—e.g., newscasts, sports events, movies, TV shows—making them inadequate to respond to the urgent need to summarize unedited “user-generated” video, whose visual quality is significantly different from edited data.
Use of FutureSystems
"Deep learning" systems represent the current state-of-the-art in many computer vision and machine-learning tasks and are becoming increasingly important in our group's research. They demand access to GPU clusters. Furthermore, such systems also tend to have many hyperparameters that must be optimized through repeated trial-and-error, varying each hyperparameter. Access to the Delta cluster should significantly speed up our research progress.
Scale of Use
At peak usage, we expect to be continuously using up to 20 GPU units on Delta for up to 7-10 days at a time (over multiple batches of processes). A reasonable estimate of average usage over the next few months could be: ~5 GPUs for about 10 hours each every week.