Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
7-2023
Abstract
Neural networks for visual content understanding have recently evolved from convolutional ones to transformers. The prior (CNN) relies on small-windowed kernels to capture the regional clues, demonstrating solid local expressiveness. On the contrary, the latter (transformer) establishes long-range global connections between localities for holistic learning. Inspired by this complementary nature, there is a growing interest in designing hybrid models which utilize both techniques. Current hybrids merely replace convolutions as simple approximations of linear projection or juxtapose a convolution branch with attention without considering the importance of local/global modeling. To tackle this, we propose a new hybrid named Adaptive Split-Fusion Transformer (ASF-former) that treats convolutional and attention branches differently with adaptive weights. Specifically, an ASF-former encoder equally splits feature channels into half to fit dual-path inputs. Then, the outputs of the dual-path are fused with weights calculated from visual cues. We also design a compact convolutional path from a concern of efficiency. Extensive experiments on standard benchmarks show that our ASF-former outperforms its CNN, transformer, and hybrid counterparts in terms of accuracy (83.9% on ImageNet-1K), under similar conditions (12.9G MACs / 56.7M Params, without large-scale pre-training). The code is available at: https://github.com/szx503045266/ASF-former.
Keywords
CNN, Gating; Hybrid; Transformer; Visual understanding
Discipline
Databases and Information Systems
Research Areas
Data Science and Engineering
Publication
Proceedings of the 2023 IEEE International Conference on Multimedia and Expo, Brisbane, Australia, July 10-14
First Page
1169
Last Page
1174
ISBN
9781665468916
Identifier
10.1109/ICME55011.2023.00204
Publisher
IEEE
City or Country
New Jersey
Citation
SU, Zixuan; CHEN, Jingjing; PANG, Lei; NGO, Chong-wah; and JIANG, Yu-Gang.
Adaptive split-fusion transformer. (2023). Proceedings of the 2023 IEEE International Conference on Multimedia and Expo, Brisbane, Australia, July 10-14. 1169-1174.
Available at: https://ink.library.smu.edu.sg/sis_research/8263
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1109/ICME55011.2023.00204