Publication Type

Conference Paper

Version

submittedVersion

Publication Date

6-2023

Abstract

In this report, we present our champion solution for Ego4D Natural Language Queries (NLQ) Challenge in CVPR 2023. Essentially, to accurately ground in a video, an effective egocentric feature extractor and a powerful grounding model are required. Motivated by this, we leverage a two-stage pre-training strategy to train egocentric feature extractors and the grounding model on video narrations, and further fine-tune the model on annotated data. In addition, we introduce a novel grounding model GroundNLQ, which employs a multi-modal multiscale grounding module for effective video and text fusion and various temporal intervals, especially for long videos. On the blind test set, GroundNLQ achieves 25.67 and 18.18 for R1@IoU=0.3 and R1@IoU=0.5, respectively, and surpasses all other teams by a noticeable margin. Our code will be released at https://github. com/houzhijian/GroundNLQ.

Discipline

Databases and Information Systems

Research Areas

Data Science and Engineering

Publication

3rd International Ego4D workshop, Vancouver, 2023 June 19

Publisher

ACM

City or Country

Vancouver

Copyright Owner and License

Authors

Share

COinS