Research Collection School Of Computing and Information Systems

Visual Commonsense R-CNN

Publication Type

Conference Proceeding Article

Version

acceptedVersion

Publication Date

6-2020

Abstract

We present a novel unsupervised feature representation learning method, Visual Commonsense Region-based Convolutional Neural Network (VC R-CNN), to serve as an improved visual region encoder for high-level tasks such as captioning and VQA. Given a set of detected object regions in an image (e.g., using Faster R-CNN), like any other unsupervised feature learning methods (e.g., word2vec), the proxy training objective of VC R-CNN is to predict the contextual objects of a region. However, they are fundamentally different: the prediction of VC R-CNN is by using causal intervention: P(Y|do(X)), while others are by using the conventional likelihood: P(Y|X). This is also the core reason why VC R-CNN can learn ``sense-making'' knowledge like chair can be sat --- while not just "common'' co-occurrences such as chair is likely to exist if table is observed. We extensively apply VC R-CNN features in prevailing models of three popular tasks: Image Captioning, VQA, and VCR, and observe consistent performance boosts across them, achieving many new state-of-the-arts.

Discipline

Artificial Intelligence and Robotics | Graphics and Human Computer Interfaces

Research Areas

Data Science and Engineering

Publication

Proceedings of the 33rd Conference on Computer Vision and Pattern Recognition, CVPR '20

Identifier

10.1109/CVPR42600.2020.01077

Publisher

IEEE

City or Country

Washington, United States

Citation

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1109/CVPR42600.2020.01077

Download

Included in

Artificial Intelligence and Robotics Commons, Graphics and Human Computer Interfaces Commons

COinS

Research Collection School Of Computing and Information Systems

Visual Commonsense R-CNN

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Publication

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Visual Commonsense R-CNN

Author

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Publication

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links