Publication Type
Journal Article
Version
acceptedVersion
Publication Date
3-2026
Abstract
Accurate correspondence extraction between distinctive pixel-wise and point-wise features is critical for image-to-point cloud (I2P) registration. Recent efforts leveraging Transformers for I2P feature representation have demonstrated potential, primarily by first capturing intra-modality global contextual dependencies via self-attention, and then learning cross-modality correlations via cross-attention. The strength of vanilla Transformers lies in modeling cross-modality global feature correlations. However, such mechanisms often struggle with the structural disparity between dense image pixels and sparse 3D points, hindering the establishment of fine-grained correspondences. Moreover, global attention may introduce ambiguity, as interactions with many inconsistent regions of intra-modality may degrade feature distinctiveness. To address these limitations, we propose CylindFormer, a novel Cylindrical Transformer model designed to establish accurate and reliable correspondences for efficient and robust I2P registration. Specifically, to overcome the inherent structural discrepancy, our method leverages cylindrical projection of 3D points onto the image plane to define spatially-aware clusters, enabling local feature aggregation of image pixels. These fused features are then aligned with 3D point features through adaptive attention to strengthen cross-modality correlations. In addition, Cylindrical Transformer introduces a cylindrical self-attention mechanism to explicitly learn intra-modality global structural consistency, effectively mitigating feature ambiguity. Extensive experiments on diverse indoor and outdoor benchmarks demonstrate the efficacy of CylindFormer. On the challenging RGB-D Scenes V2 dataset, our method improves the inlier ratio by 10.1∼16.3 percentage points and the registration recall by 1.3∼14.6 points, while achieving over 13× pose acceleration and reducing model parameters to less than one-ninth of the state-of-the-art method. The source code will be released at https://github.com/jtw220/CylindFormer soon.
Keywords
Cylindrical Transformer, Fine-grained correspondences, Image-to-Point cloud registration
Discipline
Graphics and Human Computer Interfaces | Numerical Analysis and Scientific Computing
Research Areas
Software and Cyber-Physical Systems
Publication
International Journal of Computer Vision
Volume
134
Issue
4
First Page
1
Last Page
21
ISSN
0920-5691
Identifier
10.1007/s11263-026-02747-w
Publisher
Springer
Citation
WANG, Jingtao; TANG, Hao; SUN, Yanpeng; HE, Shengfeng; and LI, Zechao.
CylindFormer: Image-to-Point cloud registration with cylindrical transformer. (2026). International Journal of Computer Vision. 134, (4), 1-21.
Available at: https://ink.library.smu.edu.sg/sis_research/11073
Copyright Owner and License
Authors
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1007/s11263-026-02747-w
Included in
Graphics and Human Computer Interfaces Commons, Numerical Analysis and Scientific Computing Commons