Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

3-2022

Abstract

We focus on the confounding bias between language and location in the visual grounding pipeline, where we find that the bias is the major visual reasoning bottleneck. For example, the grounding process is usually a trivial languagelocation association without visual reasoning, e.g., grounding any language query containing sheep to the nearly central regions, due to that most queries about sheep have groundtruth locations at the image center. First, we frame the visual grounding pipeline into a causal graph, which shows the causalities among image, query, target location and underlying confounder. Through the causal graph, we know how to break the grounding bottleneck: deconfounded visual grounding. Second, to tackle the challenge that the confounder is unobserved in general, we propose a confounder-agnostic approach called: Referring Expression Deconfounder (RED), to remove the confounding bias. Third, we implement RED as a simple language attention, which can be applied in any grounding method.

Keywords

Computer Vision (CV)

Discipline

Artificial Intelligence and Robotics | Graphics and Human Computer Interfaces

Research Areas

Intelligent Systems and Optimization

Publication

Proceedings of the 36th AAAI Conference on Artificial Intelligence, Virtual Conference, 2022 February 2 - March 1

First Page

998

Last Page

1006

Identifier

10.1609/aaai.v36i1.19983

Publisher

AAAI

City or Country

Virtual Conference

Additional URL

http://doi.org/10.1609/aaai.v36i1.19983

Share

COinS