Publication Type

Journal Article

Version

acceptedVersion

Publication Date

3-2026

Abstract

Accurate correspondence extraction between distinctive pixel-wise and point-wise features is critical for image-to-point cloud (I2P) registration. Recent efforts leveraging Transformers for I2P feature representation have demonstrated potential, primarily by first capturing intra-modality global contextual dependencies via self-attention, and then learning cross-modality correlations via cross-attention. The strength of vanilla Transformers lies in modeling cross-modality global feature correlations. However, such mechanisms often struggle with the structural disparity between dense image pixels and sparse 3D points, hindering the establishment of fine-grained correspondences. Moreover, global attention may introduce ambiguity, as interactions with many inconsistent regions of intra-modality may degrade feature distinctiveness. To address these limitations, we propose CylindFormer, a novel Cylindrical Transformer model designed to establish accurate and reliable correspondences for efficient and robust I2P registration. Specifically, to overcome the inherent structural discrepancy, our method leverages cylindrical projection of 3D points onto the image plane to define spatially-aware clusters, enabling local feature aggregation of image pixels. These fused features are then aligned with 3D point features through adaptive attention to strengthen cross-modality correlations. In addition, Cylindrical Transformer introduces a cylindrical self-attention mechanism to explicitly learn intra-modality global structural consistency, effectively mitigating feature ambiguity. Extensive experiments on diverse indoor and outdoor benchmarks demonstrate the efficacy of CylindFormer. On the challenging RGB-D Scenes V2 dataset, our method improves the inlier ratio by 10.1∼16.3 percentage points and the registration recall by 1.3∼14.6 points, while achieving over 13× pose acceleration and reducing model parameters to less than one-ninth of the state-of-the-art method. The source code will be released at https://github.com/jtw220/CylindFormer soon.

Keywords

Cylindrical Transformer, Fine-grained correspondences, Image-to-Point cloud registration

Discipline

Graphics and Human Computer Interfaces | Numerical Analysis and Scientific Computing

Research Areas

Software and Cyber-Physical Systems

Publication

International Journal of Computer Vision

Volume

134

Issue

4

First Page

1

Last Page

21

ISSN

0920-5691

Identifier

10.1007/s11263-026-02747-w

Publisher

Springer

Copyright Owner and License

Authors

Additional URL

https://doi.org/10.1007/s11263-026-02747-w

Share

COinS