Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

5-2019

Abstract

Multimodal dialogue systems are attracting increasing attention with a more natural and informative way for human-computer interaction. As one of its core components, the belief tracker estimates the user's goal at each step of the dialogue and provides a direct way to validate the ability of dialogue understanding. However, existing studies on belief trackers are largely limited to textual modality, which cannot be easily extended to capture the rich semantics in multimodal systems such as those with product images. For example, in fashion domain, the visual appearance of clothes play a crucial role in understanding the user's intention. In this case, the existing belief trackers may fail to generate accurate belief states for a multimodal dialogue system.In this paper, we present the first neural multimodal belief tracker (NMBT) to demonstrate how multimodal evidence can facilitate semantic understanding and dialogue state tracking. Given the multimodal inputs, while applying a textual encoder to represent textual utterances, the model gives special consideration to the semantics revealed in visual modality. It learns concept level fashion semantics by delving deep into image sub-regions and integrating concept probabilities via multiple instance learning. Then in each turn, an adaptive attention mechanism learns to automatically emphasize on different evidence sources of both visual and textual modalities for more accurate dialogue state prediction. We perform extensive evaluation on a multi-turn task-oriented dialogue dataset in fashion domain and the results show that our method achieves superior performance as compared to a wide range of baselines.

Discipline

Artificial Intelligence and Robotics | Databases and Information Systems

Research Areas

Data Science and Engineering; Intelligent Systems and Optimization

Publication

WWW '19: The World Wide Web Conference

First Page

2401

Last Page

2412

ISBN

9781450366748

Identifier

10.1145/3308558.3313598

Publisher

Association for Computing Machinery

City or Country

New York, NY, United States

Citation

ZHANG, Zheng; LIAO, Lizi; HUANG, Minlie; ZHU, Xiaoyan; and CHUA, Tat-Seng. Neural multimodal belief tracker with adaptive attention for dialogue systems. (2019). WWW '19: The World Wide Web Conference. 2401-2412.
Available at: https://ink.library.smu.edu.sg/sis_research/7676

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1145/3308558.3313598

Download

Included in

Artificial Intelligence and Robotics Commons, Databases and Information Systems Commons

COinS

Research Collection School Of Computing and Information Systems

Neural multimodal belief tracker with adaptive attention for dialogue systems

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Publication

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Neural multimodal belief tracker with adaptive attention for dialogue systems

Author

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Publication

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links