Research Collection School Of Computing and Information Systems

Efficient cross-modal video retrieval with meta-optimized frames

Ning HAN
Xun YANG
Ee-peng LIM, Singapore Management UniversityFollow
Hao CHEN
Qianru SUN, Singapore Management UniversityFollow

Publication Type

Journal Article

Version

submittedVersion

Publication Date

6-2024

Abstract

Cross-modal video retrieval aims to retrieve semantically relevant videos when given a textual query, and is one of the fundamental multimedia tasks. Most top-performing methods primarily leverage Vision Transformer (ViT) to extract video features [1]-[3]. However, they suffer from the high computational complexity of ViT, especially when encoding long videos. A common and simple solution is to uniformly sample a small number (e.g., 4 or 8) of frames from the target video (instead of using the whole video) as ViT inputs. The number of frames has a strong influence on the performance of ViT, e.g., using 8 frames yields better performance than using 4 frames but requires more computational resources, resulting in a trade-off. To get free from this trade-off, this paper introduces an automatic video compression method based on a bilevel optimization program (BOP) consisting of both model-level (i.e., base-level) and frame-level (i.e., meta-level) optimizations. The model-level optimization process learns a cross-modal video retrieval model whose input includes the “compressed frames” learned by framelevel optimization. In turn, frame-level optimization is achieved through gradient descent using the meta loss of the video retrieval model computed on the whole video. We call this BOP method (as well as the “compressed frames”) the Meta-Optimized Frames (MOF) approach. By incorporating MOF, the video retrieval model is able to utilize the information of whole videos (for training) while taking only a small number of input frames in its actual implementation. The convergence of MOF is guaranteed by meta gradient descent algorithms. For evaluation purposes, we conduct extensive cross-modal video retrieval experiments on three large-scale benchmarks: MSR-VTT, MSVD, and DiDeMo. Our results show that MOF is a generic and efficient method that boost multiple baseline methods, and can achieve a new state-of-the-art performance. Our code is publicly available at: https://github.com/lionel-hing/MOF.

Keywords

Cross-Modal, Multimodal, Video Compression, Video Retrieval

Discipline

Databases and Information Systems | Graphics and Human Computer Interfaces

Research Areas

Data Science and Engineering

Areas of Excellence

Digital transformation

Publication

IEEE Transactions on Multimedia

First Page

Last Page

ISSN

1520-9210

Identifier

10.1109/TMM.2024.3416669

Publisher

Institute of Electrical and Electronics Engineers

Citation

HAN, Ning; YANG, Xun; LIM, Ee-peng; CHEN, Hao; and SUN, Qianru. Efficient cross-modal video retrieval with meta-optimized frames. (2024). IEEE Transactions on Multimedia. 1-12.
Available at: https://ink.library.smu.edu.sg/sis_research/9034

Copyright Owner and License

Authors-CC-BY

Creative Commons License

This work is licensed under a Creative Commons Attribution 3.0 License.

Additional URL

https://doi.org/10.1109/TMM.2024.3416669

Download

Included in

Databases and Information Systems Commons, Graphics and Human Computer Interfaces Commons

COinS

Research Collection School Of Computing and Information Systems

Efficient cross-modal video retrieval with meta-optimized frames

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Areas of Excellence

Publication

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Efficient cross-modal video retrieval with meta-optimized frames

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Areas of Excellence

Publication

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links