Publication Type
Journal Article
Version
acceptedVersion
Publication Date
9-2020
Abstract
This work is extended from our participation in the Dialogue System Technology Challenge (DSTC7), where we participated in the Audio Visual Scene-aware Dialogue System (AVSD) track. The AVSD track evaluates how dialogue systems understand video scenes and responds to users about the video visual and audio content. We propose a hierarchical attention approach on user queries, video caption, audio and visual features that contribute to improved evaluation results. We also apply a nonlinear feature fusion approach to combine the visual and audio features for better knowledge representation. Our proposed model shows superior performance in terms of both objective evaluation and human rating as compared to the baselines. In this extended work, we also provide a more extensive review of the related work, conduct additional experiments with word-level and context-level pretrained embeddings, and investigate different qualitative aspects of the generated responses.
Keywords
Audio-visual scene-aware dialogue, Dialogue system, Multimodal attention, Neural network, Response generation
Discipline
Databases and Information Systems
Research Areas
Data Science and Engineering
Publication
Computer Speech and Language
Volume
63
First Page
1
Last Page
13
ISSN
0885-2308
Identifier
10.1016/j.csl.2020.101095
Citation
LE, Hung; SAHOO, Doyen; CHEN, Nancy F.; and HOI, Steven C. H..
Hierarchical multimodal attention for end-to-end audio-visual scene-aware dialogue response generation. (2020). Computer Speech and Language. 63, 1-13.
Available at: https://ink.library.smu.edu.sg/sis_research/5259
Copyright Owner and License
Authors
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1016/j.csl.2020.101095