Publication Type

Journal Article

Version

submittedVersion

Publication Date

12-2024

Abstract

Video Moment Retrieval (VMR) aims to identify specific event moments within untrimmed videos based on natural language queries. Existing VMR methods have been criticized for relying heavily on moment annotation bias rather than true multi-modal alignment reasoning. Weakly supervised VMR approaches inherently overcome this issue by training without precise temporal location information. However, they struggle with fine-grained semantic alignment and often yield multiple speculative predictions with prolonged video spans. In this paper, we take a step forward in the context of weakly supervised VMR by proposing a triadic temporalsemantic alignment model. Our proposed approach augments weak supervision by comprehensively addressing the multi-modal semantic alignment between query sentences and videos from both fine-grained and coarsegrained perspectives. To capture fine-grained cross-modal semantic correlations, we introduce a concept-aspect alignment strategy that leverages nouns to select relevant video clips. Additionally, an action-aspect alignment strategy with verbs is employed to capture temporal information. Furthermore, we propose an event-aspect alignment strategy that focuses on event information within coarse-grained video clips, thus mitigating the tendency towards long video span predictions during coarse-grained cross-modal semantic alignment. Extensive experiments conducted on the Charades-CD and ActivityNet-CD datasets demonstrate the superior performance of our proposed method.

Keywords

Weakly supervised learning, Video moment retrieval, Temporal-semantic alignment

Discipline

Graphics and Human Computer Interfaces | Software Engineering

Research Areas

Software and Cyber-Physical Systems

Publication

Pattern Recognition

Volume

156

First Page

Last Page

ISSN

0031-3203

Identifier

10.1016/j.patcog.2024.110819

Publisher

Elsevier

Citation

LIU, Jin; XIE, JiaLong; ZHOU, Fengyu; and HE, Shengfeng. Triadic temporal-semantic alignment for weakly-supervised video moment retrieval. (2024). Pattern Recognition. 156, 1-11.
Available at: https://ink.library.smu.edu.sg/sis_research/9286

Copyright Owner and License

Authors

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1016/j.patcog.2024.110819

Download

Included in

Graphics and Human Computer Interfaces Commons, Software Engineering Commons

COinS

Research Collection School Of Computing and Information Systems

Triadic temporal-semantic alignment for weakly-supervised video moment retrieval

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Triadic temporal-semantic alignment for weakly-supervised video moment retrieval

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links