Learning mid-level image representations with little or no supervision

Project Details

Project Lead
Alexei Efros 
Project Manager
Tinghui Zhou 
Project Members
Shiry Ginosar, Stefan Lee, Tinghui Zhou, Dinesh Jayaraman, Abhinav Shrivastava  
UC Berkeley, EECS Dept  
Computer Science (401) 
11.01 Computer and Information Sciences, General 


: A good image representation is essential for any high-level computer vision tasks, including object recognition, image retrieval, scene classification, and more. Building an expressive, multi-purpose image representation has, however, proved extremely challenging. Low-level representations (e.g. edge maps, histograms of gradients) are generally too simple to allow reliable inferences about images. Building high-level representations (e.g. object detectors, scene classifiers, deep neural networks), however, classically requires huge amounts of hand-labeled data. Hence, our project focuses on “mid-level” representations, which capture moderately complicated, informative patterns in images, between these two extremes. We call the basic units of our representation “mid level visual elements.” Visual elements may be objects, object parts, or other visual patterns, but they must fit two criteria: (1) they occur frequently in the visual world, and (2) they are informative, in the sense that they tell us something about the image. Empirically, we have found that optimizing elements for these two criteria leads to an intuitive and powerful representation, and furthermore, that the training can be done using only inexpensive labels. For instance, in [1] we used geotagged images (collected automatically from Google Street View) to optimize elements to be specific to particular geographic locations, e.g., the city of Paris. Surprisingly, the algorithm learned to detect many of the distinctive stylistic details of Paris, which are not only geographically informative, but also let us reason about the positions of facades within the images, and let us subdivide the city into different architectural styles. In [2,4], we applied similar mid-level representations to indoor scene classification, and in [3] we inferred the stylistic changes of cars over several decades. [1] “What makes Paris look like Paris”, Carl Doersch, Saurabh Singh, Abhinav Gupta, Josef Civic, and Alexei A. Efros, In SIGGRAPH 2012. [2] “Unsupervise discovery of mid-level discriminative patches”, Saurabh Singh, Abhinav Gupta, and Alexei A. Efros, In ECCV 2012. [3] “Mid-level visual element discovery as discriminative mode seeking”, Carl Doersch, Abhinav Gupta, and Alexei A. Efros, In NIPS 2013. [4] “Style-aware mid-level representation for discovering visual connections in space and time”, Yong Jae Lee, Alexei A. Efros, and Martial Hebert, In ICCV 2013.

Intellectual Merit

While our initial experiments have already shown state-of-the-art results, our prototype pipelines have only been designed to handle up to 100,000 images. Modern datasets like ImageNet contain millions of images. We are currently investigating approximations that would allow us to scale to datasets like these. We furthermore hope to investigate further applications of this technology, including 3D reconstruction of objects using point correspondences derived from visual elements.

Broader Impacts

Our long-term goal for this project is to create an image representation to improve state-of-the-art performance in a wide range of computer vision tasks, including image classification, object localization, dataset visualization, and scene understanding. Beyond computer vision, we are investigating a number of applications including large-scale analysis of painting styles, visual data mining of historical records, changes in the appearance of faces in populations over the last 100 years, etc.

Scale of Use

We would like to request 3 to 4 GPU-equipped nodes. Our storage requirements will be moderate, on the order of a few terabytes.