Publication Type
Conference Proceeding Article
Version
acceptedVersion
Publication Date
10-2022
Abstract
The few-shot learning ability of vision transformers (ViTs) is rarely investigated though heavily desired. In this work, we empirically find that with the same few-shot learning frameworks, e.g. MetaBaseline, replacing the widely used CNN feature extractor with a ViT model often severely impairs few-shot classification performance. Moreover, our empirical study shows that in the absence of inductive bias, ViTs often learn the low-qualified token dependencies under few-shot learning regime where only a few labeled training data are available, which largely contributes to the above performance degradation. To alleviate this issue, for the first time, we propose a simple yet effective few-shot training framework for ViTs, namely Self-promoted sUpervisioN (SUN). Specifically, besides the conventional global supervision for global semantic learning, SUN further pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token. This location-specific supervision tells the ViT which patch tokens are similar or dissimilar and thus accelerates token dependency learning. Moreover, it models the local semantics in each patch token to improve the object grounding and recognition capability which helps learn generalizable patterns. To improve the quality of location-specific supervision, we further propose two techniques: 1) background patch filtration to filtrate background patches out and assign them into an extra background class; and 2) spatialconsistent augmentation to introduce sufficient diversity for data augmentation while keeping the accuracy of the generated local supervisions. Experimental results show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts. Our code is publicly available at https://github.com/DongSky/few-shot-vit
Keywords
few-shot learning, location-specific supervision
Discipline
Graphics and Human Computer Interfaces
Research Areas
Intelligent Systems and Optimization
Areas of Excellence
Digital transformation
Publication
Proceedings of the 17th European Conference (ECCV 2022), Tel Aviv, Israel, October 23-27
First Page
329
Last Page
347
ISBN
9783031200434
Identifier
10.1007/978-3-031-20044-1_19
Publisher
Springer
City or Country
Cham
Citation
DONG, Bowen; ZHOU, Pan; YAN, Shuicheng; and ZUO, Wangmeng.
Self-promoted supervision for few-shot transformer. (2022). Proceedings of the 17th European Conference (ECCV 2022), Tel Aviv, Israel, October 23-27. 329-347.
Available at: https://ink.library.smu.edu.sg/sis_research/8984
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1007/978-3-031-20044-1_19