[GRNN] Learning Spatial Common Sense with Geometry-Aware Recurrent Networks (April 2019 CVPR)

## 0. Article Information and Links

- Paper's project website: https://ricsonc.github.io/grnn/
- Semantic Scholar: https://www.semanticscholar.org/paper/Learning-Spatial-Common-Sense-With-Geometry-Aware-Tung-Cheng/524cb4afc83a36a4f126de541ceadda26797ff54
- Release date: 2019/04/09
- Number of citations (as of 2020/08/17): 19
- Talk that discusses this paper from CVPR 2020 3D Scene Understanding Workshop: https://www.youtube.com/watch?v=1d-KsKjWUbo&t=38m26s


## 1. What do the authors try to accomplish?
- learn 3D understanding of a scene. "Create geometrically **consistent** mapping between world scenes and 3D latent features"
  - Predict novel camera views given short frame sequences
  - 3D object detection based only on 3D latent features, no 2D; persist even when occluded
- provide architecture for **spatial common sense**

Seek to answer:
1. Do the proposed Geometric RNNs learn spatial common sense?
2. Are geometric structural biases necessary for spatial common sense to emerge?
3. How well do GRNNs perform on egomotion estimation and 3D object detection?

## 2. What's great compared to previous research?
- SOTA computer vision relies on internet images with static viewpoints. Pictures are the output of human experts
- But real agents that move around see things from many angles. Images might not even be labelable because what is seen is half out of view or occluded.
- Propose an architecture, GRNN, that inherently has 3D structural bias. Similar to how convolutions have spatial bias, GRNNs have 3D spatial bias.
- GRNN latent space supports scene arithmetic, e.g. add or subtract 3D objects.
<img src="https://user-images.githubusercontent.com/8121216/90465162-a33ab680-e0c3-11ea-8f0b-3f3a2333a35b.png" width="66%">


- suggests that 3D latent space and egomotion-stabilization are necessary architectural choices for spatial reasoning to emerge.

## 3. Where are the key elements of the technology and method?

### Methods
#### Geometric Recurrent Neural Networks

Outperform geometry unaware nets.

<img src="https://user-images.githubusercontent.com/8121216/90457946-d1af9600-e0b1-11ea-9401-bd271ef8bca1.png" alt="GRNN" width="66%">

![image](https://user-images.githubusercontent.com/8121216/90459158-551eb680-e0b5-11ea-9840-5ed9d3c468b2.png)




#### Egomotion-Stabilized Convolutions
Proposed by Cheng et al., but also used here.

After 2D is unprojected into 3D, it's probably in a different 3D orientation. Therefore, account for egomotion (camera motion) differences by "normalizing" (aka stabilizing) the 3D into the same orientation for GRU memory.

#### Self-supervised Approach
One of the central contributions of this work. 2D --> 3D --> 2D. Can use a separate 2D view as the ground truth, and compare with what's extracted from the 3D representation.

### Experiments

#### Novel View Synthesis
New views given the 3D latent representation.

![novel_view_synthesis](https://user-images.githubusercontent.com/8121216/90467631-899c6d80-e0c9-11ea-91df-5590062ddf58.gif)


#### Object Permanence
Object detection that persists for heavily occluded objects

#### Scene Arithmetic
Add and subtract within the 3D latent representation accurately reflected in the projected result.

#### 3D Segmentation
Attaches a 3D version of MaskRCNN to predict 3D meshes; then compute loss of generated 3D mesh against true 3D mesh.
![image](https://user-images.githubusercontent.com/8121216/90466177-f44baa00-e0c5-11ea-9ae2-f41c8bfc5c86.png)


## 4. How do the authors measure success?
### Qualitative
Compare to Tower baseline, visually looks better.

### Quantitiatve
3D object detection and segmentation using Intersection over Union (IoU).

## 5. How did _you_ verify that it works?

## 6. Things to discuss? (e.g. weaknesses, potential for future work, relation to other work)

Limitations are: restricted to static scenes. Restricted to camera orbiting a fixed location; no camera translation used.

## 7. Are there any papers to read next?
- Followup paper that works on camera translations over video: http://www.cs.cmu.edu/~aharley/viewcontrast/
- Reminds me of recent Free View Synthesis paper https://www.youtube.com/watch?v=JDJPn3ZtfZs

## 8. References


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GRNN] Learning Spatial Common Sense with Geometry-Aware Recurrent Networks (April 2019 CVPR) #24

0. Article Information and Links

1. What do the authors try to accomplish?

2. What's great compared to previous research?

3. Where are the key elements of the technology and method?

Methods

Geometric Recurrent Neural Networks

Egomotion-Stabilized Convolutions

Self-supervised Approach

Experiments

Novel View Synthesis

Object Permanence

Scene Arithmetic

3D Segmentation

4. How do the authors measure success?

Qualitative

Quantitiatve

5. How did you verify that it works?

6. Things to discuss? (e.g. weaknesses, potential for future work, relation to other work)

7. Are there any papers to read next?

8. References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[GRNN] Learning Spatial Common Sense with Geometry-Aware Recurrent Networks (April 2019 CVPR) #24

Description

0. Article Information and Links

1. What do the authors try to accomplish?

2. What's great compared to previous research?

3. Where are the key elements of the technology and method?

Methods

Geometric Recurrent Neural Networks

Egomotion-Stabilized Convolutions

Self-supervised Approach

Experiments

Novel View Synthesis

Object Permanence

Scene Arithmetic

3D Segmentation

4. How do the authors measure success?

Qualitative

Quantitiatve

5. How did you verify that it works?

6. Things to discuss? (e.g. weaknesses, potential for future work, relation to other work)

7. Are there any papers to read next?

8. References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions