Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

11-2020

Abstract

Video-grounded dialogues are very challenging due to (i) the complexity of videos which contain both spatial and temporal variations, and (ii) the complexity of user utterances which query different segments and/or different objects in videos over multiple dialogue turns. However, existing approaches to video-grounded dialogues often focus on superficial temporal-level visual cues, but neglect more fine-grained spatial signals from videos. To address this drawback, we propose Bi-directional Spatio-Temporal Learning (BiST), a vision-language neural framework for high-resolution queries in videos based on textual cues. Specifically, our approach not only exploits both spatial and temporal-level information, but also learns dynamic information diffusion between the two feature spaces through spatial-to-temporal and temporal-tospatial reasoning. The bidirectional strategy aims to tackle the evolving semantics of user queries in the dialogue setting. The retrieved visual cues are used as contextual information to construct relevant responses to the users. Our empirical results and comprehensive qualitative analysis show that BiST achieves competitive performance and generates reasonable responses on a large-scale AVSD benchmark. We also adapt our BiST models to the Video QA setting, and substantially outperform prior approaches on the TGIF-QA benchmark.

Discipline

Artificial Intelligence and Robotics | Programming Languages and Compilers

Research Areas

Intelligent Systems and Optimization

Areas of Excellence

Digital transformation

Publication

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Virtual Conference, 2020 November 16-20

First Page

1846

Last Page

1859

Identifier

10.18653/v1/2020.emnlp-main.145

Publisher

ACL

City or Country

Virtual Conference

Additional URL

http://doi.org/10.18653/v1/2020.emnlp-main.145

Share

COinS