Publication Type
Journal Article
Version
acceptedVersion
Publication Date
1-2022
Abstract
Video question answering (VideoQA) has emerged as a popular research topic in recent years. Enormous efforts have been devoted to developing more effective fusion strategies and better intra-modal feature preparation. To explore these issues further, we identify two key problems. (1) Current works take almost no account of introducing action of interest in video representation. Additionally, there exists insufficient labeling data on where the action of interest is in many datasets. However, questions in VideoQA are usually action-centric. (2) Frame-to-frame relations, which can provide useful temporal attributes (e.g., state transition, action counting), lack relevant research. Based on these observations, we propose an action-centric relation transformer network (ACRTransformer) for VideoQA and make two significant improvements. (1) We explicitly consider the action recognition problem and present a visual feature encoding technique, action-based encoding (ABE), to emphasize the frames with high actionness probabilities (the probability that the frame has actions). (2) We better exploit the interplays between temporal frames using a relation transformer network (RTransformer). Experiments on popular benchmark datasets in VideoQA clearly establish our superiority over previous state-of-the-art models. Code could be found at https://github.com/op-multimodal/ACRTransformer.
Keywords
Knowledge discovery, multi-modal reasoning, Proposals, relation reasoning, Task analysis, temporal action detection, Video question answering, video representation, Visualization, Cognition, Encoding, Feature extraction
Discipline
Broadcast and Video Studies | Databases and Information Systems | Numerical Analysis and Scientific Computing
Publication
IEEE Transactions on Circuits and Systems for Video Technology
Volume
32
Issue
1
First Page
63
Last Page
74
ISSN
1051-8215
Identifier
10.1109/TCSVT.2020.3048440
Publisher
IEEE
Embargo Period
7-7-2021
Citation
ZHANG, Jipeng; SHAO, Jie; CAO, Rui; GAO, Lianli; XU, Xing; and SHEN, Heng Tao.
Action-centric relation transformer network for video question answering. (2022). IEEE Transactions on Circuits and Systems for Video Technology. 32, (1), 63-74.
Available at: https://ink.library.smu.edu.sg/sis_research/6020
Copyright Owner and License
Authors
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1109/TCSVT.2020.3048440
Included in
Broadcast and Video Studies Commons, Databases and Information Systems Commons, Numerical Analysis and Scientific Computing Commons