Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
5-2025
Abstract
Empathetic Response Generation (ERG) is one of the key tasks of the affective computing area, which aims to produce emotionally nuanced and compassionate responses to user's queries. However, existing ERG research is predominantly confined to the singleton text modality, limiting its effectiveness since human emotions are inherently conveyed through multiple modalities. To combat this, we introduce an avatar-based Multimodal ERG (MERG) task, entailing rich text, speech, and facial vision information. We first present a large-scale high-quality benchmark dataset, AvaMERG, which extends traditional text ERG by incorporating authentic human speech audio and dynamic talking-face avatar videos, encompassing a diverse range of avatar profiles and broadly covering various topics of real-world scenarios. Further, we deliberately tailor a system, named Empatheia, for MERG. Built upon a Multimodal Large Language Model (MLLM) with multimodal encoder, speech and avatar generators, Empatheia performs end-to-end MERG, with Chain-of-Empathetic reasoning mechanism integrated for enhanced empathy understanding and reasoning.Finally, we devise a list of empathetic-enhanced tuning strategies, strengthening the capabilities of emotional accuracy and content, avatar-profile consistency across modalities. Experimental results on AvaMERG data demonstrate that Empatheia consistently shows superior performance than baseline methods on both textual ERG and MERG. All data and code are open at https://AvaMERG.github.io/.
Keywords
Empathetic Response Generation, Multimodal Large LanguageModel, Avatar Generation, Affective Computing
Discipline
Artificial Intelligence and Robotics
Research Areas
Intelligent Systems and Optimization
Areas of Excellence
Digital transformation
Publication
WWW '25: Proceedings of the ACM on Web Conference 2025, Sydney, Australia, 2025 April 28 - May 2
First Page
2872
Last Page
2881
Identifier
10.1145/3696410.3714739
Publisher
ACM
City or Country
New York
Citation
ZHANG, Han; MENG, Zixiang; LUO, Meng; HAN, Hong; LIAO, Lizi; CAMBRIA, Erik; and FEI, Hao.
Towards multimodal empathetic response generation: A rich text-speech-vision avatar-based benchmark. (2025). WWW '25: Proceedings of the ACM on Web Conference 2025, Sydney, Australia, 2025 April 28 - May 2. 2872-2881.
Available at: https://ink.library.smu.edu.sg/sis_research/10763
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1145/3696410.3714739