Publication Type
Conference Proceeding Article
Version
acceptedVersion
Publication Date
1-2025
Abstract
Cross-modal representation learning is essential for interactive text-to-video search tasks. However, the representation learning is limited by the size and quality of video-caption pairs. To improve the search accuracy, we propose to enlarge the size of available video-caption pairs by leveraging multi-model LLM on video captioning. Specifically, we use LLM to generate video captions for a large video collection (i.e., WebVid dataset) and use the generated video-caption pairs to pre-train a text-to-video search model. Additionally, we use LLM to generate fine-grained captions for test video collections to enable text-to-caption retrieval. Furthermore, we build a semantic overview of the retrieved rank list based on the detailed captions in our interactive video retrieval system which act as hints for user to refine their query. Experimental results show that the generated captions are effective in improving the search accuracy of both AVS and T-KIS tasks on the TRECVid datasets.
Keywords
Interactive Video Retrieval, Multi-modal LLM, Video Captioning
Discipline
Artificial Intelligence and Robotics | Databases and Information Systems
Research Areas
Data Science and Engineering
Publication
MultiMedia Modeling: 31st International Conference on Multimedia Modeling, MMM 2025, Nara, January 8-10, Proceedings
Volume
15524
First Page
302
Last Page
309
ISBN
9789819620739
Identifier
10.1007/978-981-96-2074-6_36
Publisher
Springer
City or Country
Cham
Citation
CHENG, Yu-Tong; WU, Jiaxin; MA, Zhixin; HE, Jiangshan; WEI, Xiao-Yong; and NGO, Chong-wah.
Interactive video search with multi-modal LLM video captioning. (2025). MultiMedia Modeling: 31st International Conference on Multimedia Modeling, MMM 2025, Nara, January 8-10, Proceedings. 15524, 302-309.
Available at: https://ink.library.smu.edu.sg/sis_research/10105
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1007/978-981-96-2074-6_36