Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

6-2019

Abstract

Learning effective fusion of multi-modality features is at the heart of visual question answering. We propose a novel method of dynamically fusing multi-modal features with intra- and inter-modality information flow, which alternatively pass dynamic information between and across the visual and language modalities. It can robustly capture the high-level interactions between language and vision domains, thus significantly improves the performance of visual question answering. We also show that the proposed dynamic intra-modality attention flow conditioned on the other modality can dynamically modulate the intramodality attention of the target modality, which is vital for multimodality feature fusion. Experimental evaluations on the VQA 2.0 dataset show that the proposed method achieves state-of-the-art VQA performance. Extensive ablation studies are carried out for the comprehensive analysis of the proposed method.

Keywords

Vision + Language, Vision Applications and Systems, Visual Reasoning

Discipline

Databases and Information Systems

Research Areas

Data Science and Engineering

Publication

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): Long Beach, CA, June 15-20: Proceedings

First Page

6632

Last Page

6641

ISBN

9781728132938

Identifier

10.1109/CVPR.2019.00680

Publisher

IEEE Computer Society

City or Country

Los Alamitos, CA

Additional URL

https://doi.org/10.1109/CVPR.2019.00680

Share

COinS