Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
7-2020
Abstract
Pre-trained language models have shown remarkable success in improving various downstream NLP tasks due to their ability to capture dependencies in textual data and generate natural responses. In this paper, we leverage the power of pre-trained language models for improving video-grounded dialogue, which is very challenging and involves complex features of different dynamics: (1) Video features which can extend across both spatial and temporal dimensions; and (2) Dialogue features which involve semantic dependencies over multiple dialogue turns. We propose a framework by extending GPT-2 models to tackle these challenges by formulating video-grounded dialogue tasks as a sequence-to-sequence task, combining both visual and textual representation into a structured sequence, and fine-tuning a large pre-trained GPT-2 network. Our framework allows fine-tuning language models to capture dependencies across multiple modalities over different levels of information: spatio-temporal level in video and token-sentence level in dialogue context. We achieve promising improvement on the Audio-Visual Scene-Aware Dialogues (AVSD) benchmark from DSTC7, which supports a potential direction in this line of research.
Discipline
Artificial Intelligence and Robotics | Programming Languages and Compilers
Research Areas
Intelligent Systems and Optimization
Areas of Excellence
Digital transformation
Publication
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Virtual Conference, 2020 July 5-10,
First Page
5842
Last Page
5848
Identifier
10.18653/v1/2020.acl-main.518
Publisher
ACL
City or Country
Virtual Conference
Citation
LE, Hung and HOI, Steven C. H..
Video-grounded dialogues with pretrained generation language models. (2020). Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Virtual Conference, 2020 July 5-10,. 5842-5848.
Available at: https://ink.library.smu.edu.sg/sis_research/10169
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
http://doi.org/10.18653/v1/2020.acl-main.518
Included in
Artificial Intelligence and Robotics Commons, Programming Languages and Compilers Commons