Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
11-2023
Abstract
Given a descriptive language query, Video Moment Retrieval (VMR) aims to seek the corresponding semantic-consistent moment clip in the video, which is represented as a pair of the start and end timestamps. Although current methods have achieved satisfying performance, training these models heavily relies on the fully-annotated VMR datasets. Nonetheless, precise video temporal annotations are extremely labor-intensive and ambiguous due to the diverse preferences of different annotators.Although there are several works trying to explore weakly supervised VMR tasks with scattered annotated frames as labels, there is still much room to improve in terms of accuracy. Therefore, we design a new setting of VMR where users can easily point to small segments of non-controversy video moments and our proposed method can automatically fill in the remaining parts based on the video and query semantics. To support this, we propose a new framework named Video Moment Retrieval via Iterative Learning (VMRIL). It treats the partial temporal region as the seed, then expands the pseudo label by iterative training. In order to restrict the expansion with reasonable boundaries, we utilize a pretrained video action localization model to provide coarse guidance of potential video segments. Compared with other VMR methods, our VMRIL achieves a trade-off between satisfying performance and annotation efficiency. Experimental results show that our proposed method can achieve the SOTA performance in the weakly supervised VMR setting, and are even comparable with some fully-supervised VMR methods but with much less annotation cost.
Keywords
Current, Coarse guidance, Iterative learning, Labour-intensive, Performance, Pseudo label, Query video, Retrieval methods, Time-stamp, Video moment retrieval
Discipline
Databases and Information Systems
Research Areas
Data Science and Engineering
Publication
Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, Canada, 2023 October 29-November 3
First Page
4330
Last Page
4339
ISBN
9798400701085
Identifier
10.1145/3581783.3612088
Publisher
ACM
City or Country
New York
Citation
JI, Wei; LIANG, Renjie; LIAO, Lizi; FEI, Hao; and FENG, Fuli.
Partial annotation-based video moment retrieval via iterative learning. (2023). Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, Canada, 2023 October 29-November 3. 4330-4339.
Available at: https://ink.library.smu.edu.sg/sis_research/8585
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1145/3581783.3612088