Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
10-2022
Abstract
Video transformer naturally incurs a heavier computation burden than a static vision transformer, as the former processes �� times longer sequence than the latter under the current attention of quadratic complexity (�� 2�� 2 ). The existing works treat the temporal axis as a simple extension of spatial axes, focusing on shortening the spatio-temporal sequence by either generic pooling or local windowing without utilizing temporal redundancy. However, videos naturally contain redundant information between neighboring frames; thereby, we could potentially suppress attention on visually similar frames in a dilated manner. Based on this hypothesis, we propose the LAPS, a long-term “Leap Attention” (LA), short-term “Periodic Shift” (P-Shift) module for video transformers, with (2�� ��2 ) complexity. Specifically, the “LA” groups longterm frames into pairs, then refactors each discrete pair via attention. The “P-Shift” exchanges features between temporal neighbors to confront the loss of short-term dynamics. By replacing a vanilla 2D attention with the LAPS, we could adapt a static transformer into a video one, with zero extra parameters and neglectable computation overhead (∼2.6%). Experiments on the standard Kinetics-400 benchmark demonstrate that our LAPS transformer could achieve competitive performances in terms of accuracy, FLOPs, and Params among CNN and transformer SOTAs. We open-source our project in https://github.com/VideoNetworks/ LAPS-transformer.
Keywords
Video classification, Transformer, Shift, Leap attention
Discipline
Artificial Intelligence and Robotics | Graphics and Human Computer Interfaces
Research Areas
Intelligent Systems and Optimization
Publication
Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 2022 October 10-14
First Page
5773
Last Page
5782
ISBN
9781450392037
Identifier
10.1145/3503161.3547908
Publisher
ACM
City or Country
Lisbon, Portugal
Citation
ZHANG, Hao; CHENG, Lechao; HAO, Yanbin; and NGO, Chong-wah.
Long-term leap attention, short-term periodic shift for video classification. (2022). Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 2022 October 10-14. 5773-5782.
Available at: https://ink.library.smu.edu.sg/sis_research/7507
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
http://doi.org/10.1145/3503161.3547908
Included in
Artificial Intelligence and Robotics Commons, Graphics and Human Computer Interfaces Commons