Publication Type

Journal Article

Version

acceptedVersion

Publication Date

1-2022

Abstract

Video question answering (VideoQA) has emerged as a popular research topic in recent years. Enormous efforts have been devoted to developing more effective fusion strategies and better intra-modal feature preparation. To explore these issues further, we identify two key problems. (1) Current works take almost no account of introducing action of interest in video representation. Additionally, there exists insufficient labeling data on where the action of interest is in many datasets. However, questions in VideoQA are usually action-centric. (2) Frame-to-frame relations, which can provide useful temporal attributes (e.g., state transition, action counting), lack relevant research. Based on these observations, we propose an action-centric relation transformer network (ACRTransformer) for VideoQA and make two significant improvements. (1) We explicitly consider the action recognition problem and present a visual feature encoding technique, action-based encoding (ABE), to emphasize the frames with high actionness probabilities (the probability that the frame has actions). (2) We better exploit the interplays between temporal frames using a relation transformer network (RTransformer). Experiments on popular benchmark datasets in VideoQA clearly establish our superiority over previous state-of-the-art models. Code could be found at https://github.com/op-multimodal/ACRTransformer.

Keywords

Knowledge discovery, multi-modal reasoning, Proposals, relation reasoning, Task analysis, temporal action detection, Video question answering, video representation, Visualization, Cognition, Encoding, Feature extraction

Discipline

Broadcast and Video Studies | Databases and Information Systems | Numerical Analysis and Scientific Computing

Publication

IEEE Transactions on Circuits and Systems for Video Technology

Volume

32

Issue

1

First Page

63

Last Page

74

ISSN

1051-8215

Identifier

10.1109/TCSVT.2020.3048440

Publisher

IEEE

Embargo Period

7-7-2021

Copyright Owner and License

Authors

Additional URL

https://doi.org/10.1109/TCSVT.2020.3048440

Share

COinS