Deep Learning for Scene Understanding (3D, Functional, Semantics)

Project Details

Project Lead
Abhinav Gupta 
Project Manager
Abhinav Gupta 
Project Members
Xiaolong Wang, Abhinav Shrivastava, Naiyan Wang, Xinlei Chen, Rohit Girdhar, Sahil Shah, Ankit Laddha, Senthil Purushwalkam  
Institution
Carnegie Mellon Universiy, Robotics Institute  
Discipline
Computer Science (401) 
Subdiscipline
11.04 Information Sciences and Systems 

Abstract

The goal of this project is to develop a geometric and functional representation of our visual world for scene understanding. This project aims to harness the recent advancements in deep learning (convolutional neural networks) and explore the possible improvements in scene understanding. Another goal of this project is to explore how reasoning can be performed using convolutional neural networks.

Intellectual Merit

This project is an attempt to use deep learning based approaches for integrating physical and visual representation of the visual world with action modeling. We propose research divided into three task areas: Task 1. Physical Representation: Exploring how CNNs can be used to predict surface normals for an input image. Task 2. Functional Representation: Exploring how CNNs can be used for direct perception of affordances. Task 3. Reasoning: How CNNs can be used for combining multiple tasks.

Broader Impacts

This project is anticipated to result in major advances within the image understanding community, bringing it closer to researchers in deep learning and robotics. It is anticipated to result in improvements in: (a) 3D Scene Understanding; (b) Recognition; (c) Human Activity Understanding, and hence could be a critical enabling technology for applications such as autonomous systems, surveillance, and personal robotics. This project is also expected to contribute to education through course development, student projects (See: https://sites.google.com/site/16899fall2014/), workshops, and tutorials involving a broader audience as well as using popular online media (e.g., YouTube).

Scale of Use

We will need a few VMs for few months to run different architectures for experiments. We will also need just a few TB for storage.