Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
7-2023
Abstract
This paper tackles an emerging and challenging problem of long video temporal grounding (VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference computation cost and weaker multi-modal alignment. To address these challenges, we propose CONE, an efficient COarse-to-fiNE alignment framework. CONE is a plug-and-play framework on top of existing VTG models to handle long videos through a sliding window mechanism. Specifically, CONE (1) introduces a query-guided window selection strategy to speed up inference, and (2) proposes a coarse-to-fine mechanism via a novel incorporation of contrastive learning to enhance multi-modal alignment for long videos. Extensive experiments on two large-scale long VTG benchmarks consistently show both substantial performance gains (e.g., from 3.13 to 6.87% on MAD) and state-of-the-art results. Analyses also reveal higher efficiency as the query-guided window selection mechanism accelerates inference time by 2x on Ego4D-NLQ and 15x on MAD while keeping SOTA results. Codes have been released at https://github.com/houzhijian/CONE.
Keywords
Benchmarking, Computational linguistics, Natural language processing systems
Discipline
Artificial Intelligence and Robotics
Research Areas
Intelligent Systems and Optimization
Publication
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada, July 9-14
First Page
8013
Last Page
8028
Identifier
10.18653/v1/2023.acl-long.445
Publisher
Association for Computational Linguistics
City or Country
Texas, USA
Citation
HOU, Zhijian; ZHONG, Wanjun; JI, Lei; GAO, Difei; YAN, Kun; CHAN, Wing-Kwong; NGO, Chong-Wah; SHOU, Mike Z.; and DUAN, Nan..
CONE: An efficient COarse-to-fiNE alignment framework for long video temporal grounding. (2023). Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada, July 9-14. 8013-8028.
Available at: https://ink.library.smu.edu.sg/sis_research/8375
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
http://dx.doi.org/10.18653/v1/2023.acl-long.445