Publication Type

Conference Proceeding Article

Version

acceptedVersion

Publication Date

9-2018

Abstract

In this paper, we propose a novel Question-Guided Hybrid Convolution (QGHC) network for Visual Question Answering (VQA). Most state-of-the-art VQA methods fuse the high-level textual and visual features from the neural network and abandon the visual spatial information when learning multi-modal features.To address these problems, question-guided kernels generated from the input question are designed to convolute with visual features for capturing the textual and visual relationship in the early stage. The question-guided convolution can tightly couple the textual and visual information but also introduce more parameters when learning kernels. We apply the group convolution, which consists of question-independent kernels and question-dependent kernels, to reduce the parameter size and alleviate over-fitting. The hybrid convolution can generate discriminative multi-modal features with fewer parameters. The proposed approach is also complementary to existing bilinear pooling fusion and attention based VQA methods. By integrating with them, our method could further boost the performance. Extensive experiments on public VQA datasets validate the effectiveness of QGHC.

Keywords

VQA, Dynamic Parameter Prediction, Group Convolution

Discipline

Databases and Information Systems | Theory and Algorithms

Research Areas

Data Science and Engineering

Publication

Computer vision ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings

Volume

11205

First Page

485

Last Page

501

ISBN

9783030012465

Identifier

10.1007/978-3-030-01246-5_29

Publisher

Springer

City or Country

Cham

Copyright Owner and License

Authors

Additional URL

https://doi.org/10.1007/978-3-030-01246-5_29

Share

COinS