Publication Type

Conference Proceeding Article

Version

acceptedVersion

Publication Date

1-2025

Abstract

Cross-modal representation learning is essential for interactive text-to-video search tasks. However, the representation learning is limited by the size and quality of video-caption pairs. To improve the search accuracy, we propose to enlarge the size of available video-caption pairs by leveraging multi-model LLM on video captioning. Specifically, we use LLM to generate video captions for a large video collection (i.e., WebVid dataset) and use the generated video-caption pairs to pre-train a text-to-video search model. Additionally, we use LLM to generate fine-grained captions for test video collections to enable text-to-caption retrieval. Furthermore, we build a semantic overview of the retrieved rank list based on the detailed captions in our interactive video retrieval system which act as hints for user to refine their query. Experimental results show that the generated captions are effective in improving the search accuracy of both AVS and T-KIS tasks on the TRECVid datasets.

Keywords

Interactive Video Retrieval, Multi-modal LLM, Video Captioning

Discipline

Artificial Intelligence and Robotics | Databases and Information Systems

Research Areas

Data Science and Engineering

Publication

MultiMedia Modeling: 31st International Conference on Multimedia Modeling, MMM 2025, Nara, January 8-10, Proceedings

Volume

15524

First Page

302

Last Page

309

ISBN

9789819620739

Identifier

10.1007/978-981-96-2074-6_36

Publisher

Springer

City or Country

Cham

Citation

CHENG, Yu-Tong; WU, Jiaxin; MA, Zhixin; HE, Jiangshan; WEI, Xiao-Yong; and NGO, Chong-wah. Interactive video search with multi-modal LLM video captioning. (2025). MultiMedia Modeling: 31st International Conference on Multimedia Modeling, MMM 2025, Nara, January 8-10, Proceedings. 15524, 302-309.
Available at: https://ink.library.smu.edu.sg/sis_research/10105

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1007/978-981-96-2074-6_36

Download

Included in

Artificial Intelligence and Robotics Commons, Databases and Information Systems Commons

COinS

Research Collection School Of Computing and Information Systems

Interactive video search with multi-modal LLM video captioning

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Interactive video search with multi-modal LLM video captioning

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links