Research Collection School Of Computing and Information Systems

Contrastive video question answering via video graph transformer

Publication Type

Journal Article

Version

acceptedVersion

Publication Date

7-2023

Abstract

We propose to perform video question answering (VideoQA) in a Contrastive manner via a Video Graph Transformer model (CoVGT). CoVGT’s uniqueness and superiority are three-fold: 1) It proposes a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations and dynamics, for complex spatio-temporal reasoning. 2) It designs separate video and text transformers for contrastive learning between the video and text to perform QA, instead of multi-modal transformer for answer classification. Fine-grained video-text communication is done by additional cross-modal interaction modules. 3) It is optimized by the joint fully- and self-supervised contrastive objectives between the correct and incorrect answers, as well as the relevant and irrelevant questions respectively. With superior video encoding and QA solution, we show that CoVGT can achieve much better performances than previous arts on video reasoning tasks. Its performances even surpass those models that are pretrained with millions of external data. We further show that CoVGT can also benefit from cross-modal pretraining, yet with orders of magnitude smaller data. The results demonstrate the effectiveness and superiority of CoVGT, and additionally reveal its potential for more data-efficient pretraining. We hope our success can advance VideoQA beyond coarse recognition/description towards fine-grained relation reasoning of video contents. Our code is available at https://github.com/doc-doc/CoVGT.

Keywords

VideoQA, Cross-Modal Visual Reasoning, Video-Language, Dynamic Visual Graphs, Contrastive Learning, Transformer

Discipline

Graphics and Human Computer Interfaces

Research Areas

Intelligent Systems and Optimization

Areas of Excellence

Digital transformation

Publication

IEEE Transactions on Pattern Analysis and Machine Intelligence

Volume

Issue

First Page

13265

Last Page

13280

ISSN

0162-8828

Identifier

10.1109/TPAMI.2023.3292266

Publisher

Institute of Electrical and Electronics Engineers

Citation

XIAO, Junbin Xiao; ZHOU, Pan; YAO, Angela; LI, Yicong; HONG, Richang; YAN, Shuicheng; and CHUA, Tat-Seng. Contrastive video question answering via video graph transformer. (2023). IEEE Transactions on Pattern Analysis and Machine Intelligence. 45, (11), 13265-13280.
Available at: https://ink.library.smu.edu.sg/sis_research/9053

Copyright Owner and License

Authors

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1109/TPAMI.2023.3292266

Download

Included in

Graphics and Human Computer Interfaces Commons

COinS

Research Collection School Of Computing and Information Systems

Contrastive video question answering via video graph transformer

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Areas of Excellence

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Contrastive video question answering via video graph transformer

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Areas of Excellence

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links