Research Collection School Of Computing and Information Systems

Multimodal transformer networks for end-to-end video-grounded dialogue systems

Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

8-2019

Abstract

Developing Video-Grounded Dialogue Systems (VGDS), where a dialogue is conducted based on visual and audio aspects of a given video, is significantly more challenging than traditional image or text-grounded dialogue systems because (1) feature space of videos span across multiple picture frames, making it difficult to obtain semantic information; and (2) a dialogue agent must perceive and process information from different modalities (audio, video, caption, etc.) to obtain a comprehensive understanding. Most existing work is based on RNNs and sequence-to-sequence architectures, which are not very effective for capturing complex long-term dependencies (like in videos). To overcome this, we propose Multimodal Transformer Networks (MTN) to encode videos and incorporate information from different modalities. We also propose query-aware attention through an auto-encoder to extract query-aware features from non-text modalities. We develop a training procedure to simulate token-level decoding to improve the quality of generated responses during inference. We get state of the art performance on Dialogue System Technology Challenge 7 (DSTC7). Our model also generalizes to another multimodal visual-grounded dialogue task, and obtains promising performance.

Discipline

Databases and Information Systems | Graphics and Human Computer Interfaces | OS and Networks

Research Areas

Data Science and Engineering

Publication

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Florence, Italy, 2019 July 28 - August 2

First Page

5612

Last Page

5623

Publisher

ACL

City or Country

Arlington, VA

Citation

LE, Hung; SAHOO, Doyen; CHEN, Nancy F.; and HOI, Steven C. H.. Multimodal transformer networks for end-to-end video-grounded dialogue systems. (2019). Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Florence, Italy, 2019 July 28 - August 2. 5612-5623.
Available at: https://ink.library.smu.edu.sg/sis_research/4428

Copyright Owner and License

Publisher

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://www.aclweb.org/anthology/P19-1564/

Download

Find it in your library

Included in

Databases and Information Systems Commons, Graphics and Human Computer Interfaces Commons, OS and Networks Commons

COinS

Research Collection School Of Computing and Information Systems

Multimodal transformer networks for end-to-end video-grounded dialogue systems

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Publication

First Page

Last Page

Publisher

City or Country

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Multimodal transformer networks for end-to-end video-grounded dialogue systems

Author

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Publication

First Page

Last Page

Publisher

City or Country

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links