Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
11-2024
Abstract
In the realm of dialogue-to-image retrieval, the primary challenge is to fetch images from a pre-compiled database that accurately reflect the intent embedded within the dialogue history. Existing methods often overemphasize inter-modal alignment, neglecting the nuanced nature of conversational context. Dialogue histories are frequently cluttered with redundant information and often lack direct image descriptions, leading to a substantial disconnect between conversational content and visual representation. This study introduces VCU, a novel framework designed to enhance the comprehension of dialogue history and improve cross-modal matching for image retrieval. VCU leverages large language models (LLMs) to perform a two-step extraction process. It generates precise image-related descriptions from dialogues, while also enhancing visual representation by utilizing object-list texts associated with images. Additionally, auxiliary query collections are constructed to balance the matching process, thereby reducing bias in similarity computations. Experimental results demonstrate that VCU significantly outperforms baseline methods in dialogue-to-image retrieval tasks, highlighting its potential for practical application and effectiveness in bridging the gap between dialogue context and visual content.
Keywords
Dialogue-to-image retrieval, Dialogue history comprehension, Visual context understanding
Discipline
Artificial Intelligence and Robotics | Computer Sciences
Research Areas
Data Science and Engineering; Intelligent Systems and Optimization
Areas of Excellence
Digital transformation
Publication
Proceedings of the 19th Conference on Empirical Methods in Natural Language Processing (EMNLP 2024) : Miami, Florida, USA, November 12-16
First Page
7929
Last Page
7942
Publisher
Association for Computational Linguistics
City or Country
Miami, Florida, USA
Citation
WEI, Zhaohui; LIAO, Lizi; DU, Xiaoyu; and XIANG, Xinguang.
Balancing visual context understanding in dialogue for image retrieval. (2024). Proceedings of the 19th Conference on Empirical Methods in Natural Language Processing (EMNLP 2024) : Miami, Florida, USA, November 12-16. 7929-7942.
Available at: https://ink.library.smu.edu.sg/sis_research/9693
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.