Research Collection School Of Computing and Information Systems

Learning temporal dynamics in videos with image transformer

Publication Type

Journal Article

Publication Date

4-2024

Abstract

Temporal dynamics represent the evolving of video content over time, which are critical for action recognition. In this paper, we ask the question: can the off-the-shelf image transformer architecture learn temporal dynamics in videos? To this end, we propose Multidimensional Stacked Image (MSImage) as a new arrangement of video data, which can be fed to image transformers. Technically, MSImage is a high-resolution image that is composed of several evenly-sampled video clips stacked along the channel and space dimensions. The frames in each clip are concatenated along the channel dimension for the transformers to infer short-term dynamics. Meanwhile, the clips are arranged on different spatial positions for learning long-term dynamics. On this basis we propose MSImageFormer – a new variant of image transformer that takes MSImage as the input and is jointly optimized by video classification loss and a new dynamics enhancement loss. The network optimization attends to the high-frequency component of MSImage, avoiding overfitting to static visual patterns. We empirically demonstrate the merits of the MSImageFormer on six action recognition benchmarks. With only 2D image transformer as the classifier, our MSImageFormer achieves 85.3% and 69.7% top-1 accuracy on Kinetics-400 and Something-Something V2 datasets, respectively. Despite with fewer computations, the results are comparable to the SOTA 3D CNNs and video transformers.

Keywords

Neural networks, Video action recognition, Vision transformer, Video transformers, Three-dimensional displays, Optical flow, Visualization, Optimization, Image recognition

Discipline

Artificial Intelligence and Robotics

Research Areas

Intelligent Systems and Optimization

Publication

IEEE Transactions on Multimedia

Volume

First Page

8915

Last Page

8927

ISSN

1520-9210

Identifier

10.1109/TMM.2024.3383662

Publisher

Institute of Electrical and Electronics Engineers

Citation

SHU, Yan; QIU, Z; LONG, Fuchen; YAO, Ting; NGO, Chong-wah; and MEI, Tao. Learning temporal dynamics in videos with image transformer. (2024). IEEE Transactions on Multimedia. 26, 8915-8927.
Available at: https://ink.library.smu.edu.sg/sis_research/9860

Additional URL

https://doi.org/10.1109/TMM.2024.3383662

This document is currently not available here.

COinS

Research Collection School Of Computing and Information Systems

Learning temporal dynamics in videos with image transformer

Publication Type

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Additional URL

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Learning temporal dynamics in videos with image transformer

Author

Publication Type

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Additional URL

Share

Search

Links

Browse

Links