Learning temporal dynamics in videos with image transformer
Publication Type
Journal Article
Publication Date
4-2024
Abstract
Temporal dynamics represent the evolving of video content over time, which are critical for action recognition. In this paper, we ask the question: can the off-the-shelf image transformer architecture learn temporal dynamics in videos? To this end, we propose Multidimensional Stacked Image (MSImage) as a new arrangement of video data, which can be fed to image transformers. Technically, MSImage is a high-resolution image that is composed of several evenly-sampled video clips stacked along the channel and space dimensions. The frames in each clip are concatenated along the channel dimension for the transformers to infer short-term dynamics. Meanwhile, the clips are arranged on different spatial positions for learning long-term dynamics. On this basis we propose MSImageFormer – a new variant of image transformer that takes MSImage as the input and is jointly optimized by video classification loss and a new dynamics enhancement loss. The network optimization attends to the high-frequency component of MSImage, avoiding overfitting to static visual patterns. We empirically demonstrate the merits of the MSImageFormer on six action recognition benchmarks. With only 2D image transformer as the classifier, our MSImageFormer achieves 85.3% and 69.7% top-1 accuracy on Kinetics-400 and Something-Something V2 datasets, respectively. Despite with fewer computations, the results are comparable to the SOTA 3D CNNs and video transformers.
Keywords
Neural networks, Video action recognition, Vision transformer, Video transformers, Three-dimensional displays, Optical flow, Visualization, Optimization, Image recognition
Discipline
Artificial Intelligence and Robotics
Research Areas
Intelligent Systems and Optimization
Publication
IEEE Transactions on Multimedia
Volume
26
First Page
8915
Last Page
8927
ISSN
1520-9210
Identifier
10.1109/TMM.2024.3383662
Publisher
Institute of Electrical and Electronics Engineers
Citation
SHU, Yan; QIU, Z; LONG, Fuchen; YAO, Ting; NGO, Chong-wah; and MEI, Tao.
Learning temporal dynamics in videos with image transformer. (2024). IEEE Transactions on Multimedia. 26, 8915-8927.
Available at: https://ink.library.smu.edu.sg/sis_research/9860
Additional URL
https://doi.org/10.1109/TMM.2024.3383662