Publication Type
Conference Proceeding Article
Version
acceptedVersion
Publication Date
10-2022
Abstract
Multi-scale Vision Transformer (ViT) has emerged as a powerful backbone for computer vision tasks, while the self-attention computation in Transformer scales quadratically w.r.t. the input patch number. Thus, existing solutions commonly employ down-sampling operations (e.g., average pooling) over keys/values to dramatically reduce the computational cost. In this work, we argue that such over-aggressive down-sampling design is not invertible and inevitably causes information dropping especially for high-frequency components in objects (e.g., texture details). Motivated by the wavelet theory, we construct a new Wavelet Vision Transformer (Wave-ViT) that formulates the invertible down-sampling with wavelet transforms and self-attention learning in a unified way. This proposal enables self-attention learning with lossless down-sampling over keys/values, facilitating the pursuing of a better efficiency-vs-accuracy trade-off. Furthermore, inverse wavelet transforms are leveraged to strengthen self-attention outputs by aggregating local contexts with enlarged receptive field. We validate the superiority of Wave-ViT through extensive experiments over multiple vision tasks (e.g., image recognition, object detection and instance segmentation). Its performances surpass state-of-the-art ViT backbones with comparable FLOPs. Source code is available at https://github.com/YehLi/ImageNetModel.
Keywords
Vision transformer, Wavelet transform, Self-attention, learning, Image recognition
Discipline
Artificial Intelligence and Robotics | Graphics and Human Computer Interfaces
Research Areas
Intelligent Systems and Optimization
Publication
Computer Vision ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27: Proceedings
Volume
13685
First Page
328
Last Page
345
ISBN
9783031198069
Identifier
10.1007/978-3-031-19806-9_19
Publisher
Springer
City or Country
Cham
Citation
YAO, Ting; PAN, Yingwei; LI, Yehao; NGO, Chong-wah; and MEI, Tao.
Wave-ViT: Unifying wavelet and transformers for visual representation learning. (2022). Computer Vision ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27: Proceedings. 13685, 328-345.
Available at: https://ink.library.smu.edu.sg/sis_research/7508
Copyright Owner and License
Authors
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1007/978-3-031-19806-9_19
Included in
Artificial Intelligence and Robotics Commons, Graphics and Human Computer Interfaces Commons