Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

10-2022

Abstract

Video transformer naturally incurs a heavier computation burden than a static vision transformer, as the former processes �� times longer sequence than the latter under the current attention of quadratic complexity (�� 2�� 2 ). The existing works treat the temporal axis as a simple extension of spatial axes, focusing on shortening the spatio-temporal sequence by either generic pooling or local windowing without utilizing temporal redundancy. However, videos naturally contain redundant information between neighboring frames; thereby, we could potentially suppress attention on visually similar frames in a dilated manner. Based on this hypothesis, we propose the LAPS, a long-term “Leap Attention” (LA), short-term “Periodic Shift” (P-Shift) module for video transformers, with (2�� ��2 ) complexity. Specifically, the “LA” groups longterm frames into pairs, then refactors each discrete pair via attention. The “P-Shift” exchanges features between temporal neighbors to confront the loss of short-term dynamics. By replacing a vanilla 2D attention with the LAPS, we could adapt a static transformer into a video one, with zero extra parameters and neglectable computation overhead (∼2.6%). Experiments on the standard Kinetics-400 benchmark demonstrate that our LAPS transformer could achieve competitive performances in terms of accuracy, FLOPs, and Params among CNN and transformer SOTAs. We open-source our project in https://github.com/VideoNetworks/ LAPS-transformer.

Keywords

Video classification, Transformer, Shift, Leap attention

Discipline

Artificial Intelligence and Robotics | Graphics and Human Computer Interfaces

Research Areas

Intelligent Systems and Optimization

Publication

Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 2022 October 10-14

First Page

5773

Last Page

5782

ISBN

9781450392037

Identifier

10.1145/3503161.3547908

Publisher

ACM

City or Country

Lisbon, Portugal

Additional URL

http://doi.org/10.1145/3503161.3547908

Share

COinS