Publication Type

Conference Proceeding Article

Version

acceptedVersion

Publication Date

10-2022

Abstract

While transformers have shown great potential on video recognition with their strong capability of capturing long-range dependencies, they often suffer high computational costs induced by the self-attention to the huge number of 3D tokens. In this paper, we present a new transformer architecture termed DualFormer, which can efficiently perform space-time attention for video recognition. Concretely, DualFormer stratifies the full space-time attention into dual cascaded levels, i.e., to first learn fine-grained local interactions among nearby 3D tokens, and then to capture coarse-grained global dependencies between the query token and global pyramid contexts. Different from existing methods that apply space-time factorization or restrict attention computations within local windows for improving efficiency, our local-global stratification strategy can well capture both short- and long-range spatiotemporal dependencies, and meanwhile greatly reduces the number of keys and values in attention computation to boost efficiency. Experimental results verify the superiority of DualFormer on five video benchmarks against existing methods. In particular, DualFormer achieves 82.9%/85.2% top-1 accuracy on Kinetics-400/600 with ∼1000G inference FLOPs which is at least 3.2× fewer than existing methods with similar performance. We have released the source code at https://github.com/sail-sg/dualformer.

Keywords

Efficient video transformer, Local and global attention

Discipline

Artificial Intelligence and Robotics | Graphics and Human Computer Interfaces

Research Areas

Intelligent Systems and Optimization

Areas of Excellence

Digital transformation

Publication

Proceedings of the 17th European Conference (ECCV 2022), Tel Aviv, Israel, October 23-27

First Page

577

Last Page

595

ISBN

9783031198298

Identifier

10.1007/978-3-031-19830-4_33

Publisher

Springer

City or Country

Cham

Citation

LIANG, Yuxuan; ZHOU, Pan; ZIMMERMANN, Roger; and YAN, Shuicheng. DualFormer: Local-global stratified transformer for efficient video recognition. (2022). Proceedings of the 17th European Conference (ECCV 2022), Tel Aviv, Israel, October 23-27. 577-595.
Available at: https://ink.library.smu.edu.sg/sis_research/8980

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1007/978-3-031-19830-4_33

Download

Included in

Artificial Intelligence and Robotics Commons, Graphics and Human Computer Interfaces Commons

COinS

Research Collection School Of Computing and Information Systems

DualFormer: Local-global stratified transformer for efficient video recognition

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Areas of Excellence

Publication

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

DualFormer: Local-global stratified transformer for efficient video recognition

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Areas of Excellence

Publication

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links