Research Collection School Of Computing and Information Systems

Frame-voyager: Learning to query frames for video large language models

Publication Type

Conference Proceeding Article

Version

acceptedVersion

Publication Date

4-2025

Abstract

Video Large Language Models (Video-LLMs) have made remarkable progress in video understanding tasks. However, they are constrained by the maximum length of input tokens, making it impractical to input entire videos. Existing frame selection approaches, such as uniform frame sampling and text-frame retrieval, fail to account for the information density variations in the videos or the complex instructions in the tasks, leading to sub-optimal performance. In this paper, we propose Frame-Voyager that learns to query informative frame combinations, based on the given textual queries in the task. To train Frame-Voyager, we introduce a new data collection and labeling pipeline, by ranking frame combinations using a pre-trained Video-LLM. Given a video of M frames, we traverse its T-frame combinations, feed them into a Video-LLM, and rank them based on Video-LLM's prediction losses. Using this ranking as supervision, we train Frame-Voyager to query the frame combinations with lower losses. In experiments, we evaluate Frame-Voyager on four Video Question Answering benchmarks by plugging it into two different Video-LLMs. The experimental results demonstrate that Frame-Voyager achieves impressive results in all settings, highlighting its potential as a plug-and-play solution for Video-LLMs.

Discipline

Artificial Intelligence and Robotics | Programming Languages and Compilers

Research Areas

Intelligent Systems and Optimization

Areas of Excellence

Digital transformation

Publication

Proceedings of the Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28

First Page

Last Page

City or Country

Singapore

Citation

YU, Sicheng; JIN, Chengkai; WANG, Huanyu; CHEN, Zhenghao; JIN, Sheng; ZUO, Zhongrong; XU, Xiaolei; SUN, Zhenbang; ZHANG, Bingni; WU, Jiawei; ZHANG, Hao; and Qianru SUN. Frame-voyager: Learning to query frames for video large language models. (2025). Proceedings of the Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28. 1-26.
Available at: https://ink.library.smu.edu.sg/sis_research/10147

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Download

Included in

Artificial Intelligence and Robotics Commons, Programming Languages and Compilers Commons

COinS

Research Collection School Of Computing and Information Systems

Frame-voyager: Learning to query frames for video large language models

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Areas of Excellence

Publication

First Page

Last Page

City or Country

Citation

Creative Commons License

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Frame-voyager: Learning to query frames for video large language models

Author

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Areas of Excellence

Publication

First Page

Last Page

City or Country

Citation

Creative Commons License

Included in

Share

Search

Links

Browse

Links