Research Collection School Of Computing and Information Systems

PosMLP-Video: Spatial and temporal relative position encoding for efficient video recognition

Publication Type

Journal Article

Version

publishedVersion

Publication Date

6-2024

Abstract

In recent years, vision Transformers and MLPs have demonstrated remarkable performance in image understanding tasks. However, their inherently dense computational operators, such as self-attention and token-mixing layers, pose significant challenges when applied to spatio-temporal video data. To address this gap, we propose PosMLP-Video, a lightweight yet powerful MLP-like backbone for video recognition. Instead of dense operators, we use efficient relative positional encoding (RPE) to build pairwise token relations, leveraging small-sized parameterized relative position biases to obtain each relation score. Specifically, to enable spatio-temporal modeling, we extend the image PosMLP’s positional gating unit to temporal, spatial, and spatio-temporal variants, namely PoTGU, PoSGU, and PoSTGU, respectively. These gating units can be feasibly combined into three types of spatio-temporal factorized positional MLP blocks, which not only decrease model complexity but also maintain good performance. Additionally, we improve the locality of modeling using window partitioning and enrich relative positional relationships using channel grouping. Experimental results demonstrate that PosMLP-Video achieves competitive speed-accuracy trade-offs compared to the previous state-of-the-art models. In particular, PosMLP-Video pre-trained on ImageNet1K achieves 59.0%/70.3% top-1 accuracy on Something-Something V1/V2 and 82.1% top-1 accuracy on Kinetics-400 while requiring much fewer parameters and FLOPs than other models. The code will be made publicly available.

Keywords

Multi-layer perceptron, Positional encoding, Spatio-temporal modeling, Video recognition

Discipline

Artificial Intelligence and Robotics | Graphics and Human Computer Interfaces

Research Areas

Intelligent Systems and Optimization

Publication

International Journal of Computer Vision

Volume

132

Issue

First Page

5820

Last Page

5840

ISSN

0920-5691

Identifier

10.21203/rs.3.rs-3485088/v1

Publisher

Springer

City or Country

Cham

Citation

HAO, Yanbin; ZHOU, Diansong; WANG, Zhicai; NGO, Chong-wah; HE, Xiangnan; and WANG, Meng. PosMLP-Video: Spatial and temporal relative position encoding for efficient video recognition. (2024). International Journal of Computer Vision. 132, (12), 5820-5840.
Available at: https://ink.library.smu.edu.sg/sis_research/8256

Copyright Owner and License

Authors CC-BY

Creative Commons License

This work is licensed under a Creative Commons Attribution 3.0 License.

Additional URL

https://doi.org/10.21203/rs.3.rs-3485088/v1

Download

Included in

Artificial Intelligence and Robotics Commons, Graphics and Human Computer Interfaces Commons

COinS

Research Collection School Of Computing and Information Systems

PosMLP-Video: Spatial and temporal relative position encoding for efficient video recognition

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

City or Country

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

PosMLP-Video: Spatial and temporal relative position encoding for efficient video recognition

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

City or Country

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links