Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

11-2020

Abstract

Video-grounded dialogues are very challenging due to (i) the complexity of videos which contain both spatial and temporal variations, and (ii) the complexity of user utterances which query different segments and/or different objects in videos over multiple dialogue turns. However, existing approaches to video-grounded dialogues often focus on superficial temporal-level visual cues, but neglect more fine-grained spatial signals from videos. To address this drawback, we propose Bi-directional Spatio-Temporal Learning (BiST), a vision-language neural framework for high-resolution queries in videos based on textual cues. Specifically, our approach not only exploits both spatial and temporal-level information, but also learns dynamic information diffusion between the two feature spaces through spatial-to-temporal and temporal-tospatial reasoning. The bidirectional strategy aims to tackle the evolving semantics of user queries in the dialogue setting. The retrieved visual cues are used as contextual information to construct relevant responses to the users. Our empirical results and comprehensive qualitative analysis show that BiST achieves competitive performance and generates reasonable responses on a large-scale AVSD benchmark. We also adapt our BiST models to the Video QA setting, and substantially outperform prior approaches on the TGIF-QA benchmark.

Discipline

Artificial Intelligence and Robotics | Programming Languages and Compilers

Research Areas

Intelligent Systems and Optimization

Areas of Excellence

Digital transformation

Publication

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Virtual Conference, 2020 November 16-20

First Page

1846

Last Page

1859

Identifier

10.18653/v1/2020.emnlp-main.145

Publisher

ACL

City or Country

Virtual Conference

Citation

LE, Hung; SAHOO, Doyen; CHEN, Nancy F.; and HOI, Steven C. H.. BiST: Bi-directional spatio-temporal reasoning for video-grounded dialogues. (2020). Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Virtual Conference, 2020 November 16-20. 1846-1859.
Available at: https://ink.library.smu.edu.sg/sis_research/10165

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

http://doi.org/10.18653/v1/2020.emnlp-main.145

Download

Included in

Artificial Intelligence and Robotics Commons, Programming Languages and Compilers Commons

COinS

Research Collection School Of Computing and Information Systems

BiST: Bi-directional spatio-temporal reasoning for video-grounded dialogues

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Areas of Excellence

Publication

First Page

Last Page

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

BiST: Bi-directional spatio-temporal reasoning for video-grounded dialogues

Author

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Areas of Excellence

Publication

First Page

Last Page

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links