Publication Type
Conference Paper
Version
submittedVersion
Publication Date
6-2023
Abstract
In this report, we present our champion solution for Ego4D Natural Language Queries (NLQ) Challenge in CVPR 2023. Essentially, to accurately ground in a video, an effective egocentric feature extractor and a powerful grounding model are required. Motivated by this, we leverage a two-stage pre-training strategy to train egocentric feature extractors and the grounding model on video narrations, and further fine-tune the model on annotated data. In addition, we introduce a novel grounding model GroundNLQ, which employs a multi-modal multiscale grounding module for effective video and text fusion and various temporal intervals, especially for long videos. On the blind test set, GroundNLQ achieves 25.67 and 18.18 for R1@IoU=0.3 and R1@IoU=0.5, respectively, and surpasses all other teams by a noticeable margin. Our code will be released at https://github. com/houzhijian/GroundNLQ.
Discipline
Databases and Information Systems
Research Areas
Data Science and Engineering
Publication
3rd International Ego4D workshop, Vancouver, 2023 June 19
Publisher
ACM
City or Country
Vancouver
Citation
HOU, Zhijian; JI, Lei; GAO, Difei; ZHONG, Wanjun; YAN, Kun; NGO, Chong-wah; CHAN, Wing-Kwong; NGO, Chong-Wah; DUAN, Nan; and SHOU, Mike Zheng.
GroundNLQ @ Ego4D natural language queries challenge 2023. (2023). 3rd International Ego4D workshop, Vancouver, 2023 June 19.
Available at: https://ink.library.smu.edu.sg/sis_research/8416
Copyright Owner and License
Authors
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.