Stream-ViT: Learning streamlined convolutions in Vision Transformer

Publication Type

Journal Article

Publication Date

1-2025

Abstract

Recently Vision Transformer (ViT) and Convolution Neural Network (CNN) start to emerge as a hybrid deep architecture with better model capacity, generalization, and latency trade-off. Most of these hybrid architectures often directly stack self-attention module with static convolution or fuse their outputs through two pathways within each block. Instead, we present a new Transformer architecture (namely Stream-ViT) to novelly integrate ViT with streamlined convolutions, i.e., a series of high-to-low resolution convolutions. The kernels of each convolution are dynamically learnt on a basis of current input features plus pre-learnt kernels throughout the whole network. The new architecture incorporates a critical pathway to streamline kernel generation that triggers the interactions between dynamically learnt convolutions across different layers. Moreover, the introduction of a layer-wise streamlined convolution is functionally equivalent to a squeezed version of multi-branch convolution structure, thereby improving the capacity of self-attention module with enlarged cardinality in a cost-efficient manner. We validate the superiority of Stream-ViT over multiple vision tasks, and its performances surpass state-of-the-art ViT and CNN backbones with comparable FLOPs.

Discipline

Artificial Intelligence and Robotics | Software Engineering

Research Areas

Intelligent Systems and Optimization

Publication

IEEE Transactions on Multimedia

Volume

27

First Page

3755

Last Page

3765

ISSN

1520-9210

Identifier

10.1109/TMM.2025.3535321

Publisher

Institute of Electrical and Electronics Engineers

Additional URL

https://doi.org/10.1109/TMM.2025.3535321

This document is currently not available here.

Share

COinS