Research Collection School Of Computing and Information Systems

Hierarchical multimodal attention for end-to-end audio-visual scene-aware dialogue response generation

Hung LE, Singapore Management UniversityFollow
Doyen SAHOO, Singapore Management UniversityFollow
Nancy F. CHEN
Steven C. H. HOI, Singapore Management UniversityFollow

Publication Type

Journal Article

Version

acceptedVersion

Publication Date

9-2020

Abstract

This work is extended from our participation in the Dialogue System Technology Challenge (DSTC7), where we participated in the Audio Visual Scene-aware Dialogue System (AVSD) track. The AVSD track evaluates how dialogue systems understand video scenes and responds to users about the video visual and audio content. We propose a hierarchical attention approach on user queries, video caption, audio and visual features that contribute to improved evaluation results. We also apply a nonlinear feature fusion approach to combine the visual and audio features for better knowledge representation. Our proposed model shows superior performance in terms of both objective evaluation and human rating as compared to the baselines. In this extended work, we also provide a more extensive review of the related work, conduct additional experiments with word-level and context-level pretrained embeddings, and investigate different qualitative aspects of the generated responses.

Keywords

Audio-visual scene-aware dialogue, Dialogue system, Multimodal attention, Neural network, Response generation

Discipline

Databases and Information Systems

Research Areas

Data Science and Engineering

Publication

Computer Speech and Language

Volume

First Page

Last Page

ISSN

0885-2308

Identifier

10.1016/j.csl.2020.101095

Citation

LE, Hung; SAHOO, Doyen; CHEN, Nancy F.; and HOI, Steven C. H.. Hierarchical multimodal attention for end-to-end audio-visual scene-aware dialogue response generation. (2020). Computer Speech and Language. 63, 1-13.
Available at: https://ink.library.smu.edu.sg/sis_research/5259

Copyright Owner and License

Authors

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1016/j.csl.2020.101095

Download

Find it in your library

Included in

Databases and Information Systems Commons

COinS

Research Collection School Of Computing and Information Systems

Hierarchical multimodal attention for end-to-end audio-visual scene-aware dialogue response generation

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

First Page

Last Page

ISSN

Identifier

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Hierarchical multimodal attention for end-to-end audio-visual scene-aware dialogue response generation

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

First Page

Last Page

ISSN

Identifier

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links