Stream-ViT: Learning streamlined convolutions in Vision Transformer
Publication Type
Journal Article
Publication Date
1-2025
Abstract
Recently Vision Transformer (ViT) and Convolution Neural Network (CNN) start to emerge as a hybrid deep architecture with better model capacity, generalization, and latency trade-off. Most of these hybrid architectures often directly stack self-attention module with static convolution or fuse their outputs through two pathways within each block. Instead, we present a new Transformer architecture (namely Stream-ViT) to novelly integrate ViT with streamlined convolutions, i.e., a series of high-to-low resolution convolutions. The kernels of each convolution are dynamically learnt on a basis of current input features plus pre-learnt kernels throughout the whole network. The new architecture incorporates a critical pathway to streamline kernel generation that triggers the interactions between dynamically learnt convolutions across different layers. Moreover, the introduction of a layer-wise streamlined convolution is functionally equivalent to a squeezed version of multi-branch convolution structure, thereby improving the capacity of self-attention module with enlarged cardinality in a cost-efficient manner. We validate the superiority of Stream-ViT over multiple vision tasks, and its performances surpass state-of-the-art ViT and CNN backbones with comparable FLOPs.
Discipline
Artificial Intelligence and Robotics | Software Engineering
Research Areas
Intelligent Systems and Optimization
Publication
IEEE Transactions on Multimedia
Volume
27
First Page
3755
Last Page
3765
ISSN
1520-9210
Identifier
10.1109/TMM.2025.3535321
Publisher
Institute of Electrical and Electronics Engineers
Citation
PAN, Yingwei; LI, Yehao; YAO, Ting; NGO, Chong-wah; and MEI, Tao.
Stream-ViT: Learning streamlined convolutions in Vision Transformer. (2025). IEEE Transactions on Multimedia. 27, 3755-3765.
Available at: https://ink.library.smu.edu.sg/sis_research/10814
Additional URL
https://doi.org/10.1109/TMM.2025.3535321