Publication Type

Conference Proceeding Article

Version

acceptedVersion

Publication Date

10-2022

Abstract

Multi-scale Vision Transformer (ViT) has emerged as a powerful backbone for computer vision tasks, while the self-attention computation in Transformer scales quadratically w.r.t. the input patch number. Thus, existing solutions commonly employ down-sampling operations (e.g., average pooling) over keys/values to dramatically reduce the computational cost. In this work, we argue that such over-aggressive down-sampling design is not invertible and inevitably causes information dropping especially for high-frequency components in objects (e.g., texture details). Motivated by the wavelet theory, we construct a new Wavelet Vision Transformer (Wave-ViT) that formulates the invertible down-sampling with wavelet transforms and self-attention learning in a unified way. This proposal enables self-attention learning with lossless down-sampling over keys/values, facilitating the pursuing of a better efficiency-vs-accuracy trade-off. Furthermore, inverse wavelet transforms are leveraged to strengthen self-attention outputs by aggregating local contexts with enlarged receptive field. We validate the superiority of Wave-ViT through extensive experiments over multiple vision tasks (e.g., image recognition, object detection and instance segmentation). Its performances surpass state-of-the-art ViT backbones with comparable FLOPs. Source code is available at https://github.com/YehLi/ImageNetModel.

Keywords

Vision transformer, Wavelet transform, Self-attention, learning, Image recognition

Discipline

Artificial Intelligence and Robotics | Graphics and Human Computer Interfaces

Research Areas

Intelligent Systems and Optimization

Publication

Computer Vision ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27: Proceedings

Volume

13685

First Page

328

Last Page

345

ISBN

9783031198069

Identifier

10.1007/978-3-031-19806-9_19

Publisher

Springer

City or Country

Cham

Citation

YAO, Ting; PAN, Yingwei; LI, Yehao; NGO, Chong-wah; and MEI, Tao. Wave-ViT: Unifying wavelet and transformers for visual representation learning. (2022). Computer Vision ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27: Proceedings. 13685, 328-345.
Available at: https://ink.library.smu.edu.sg/sis_research/7508

Copyright Owner and License

Authors

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1007/978-3-031-19806-9_19

Download

Find it in your library

Included in

Artificial Intelligence and Robotics Commons, Graphics and Human Computer Interfaces Commons

COinS

Research Collection School Of Computing and Information Systems

Wave-ViT: Unifying wavelet and transformers for visual representation learning

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Wave-ViT: Unifying wavelet and transformers for visual representation learning

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links