Publication Type

Conference Proceeding Article

Version

acceptedVersion

Publication Date

10-2025

Abstract

3Dvisual grounding aims to identify and localize objects in a 3Dspacebasedontextualdescriptions. However, existing methods struggle with disentangling targets from anchors in complex multi-anchor queries and resolving inconsisten cies in spatial descriptions caused by perspective variations. To tackle these challenges, we propose ViewSRD, a frame work that formulates 3D visual grounding as a structured multi-view decomposition process. First, the Simple Rela tion Decoupling (SRD) module restructures complex multi anchor queries into a set of targeted single-anchor state ments, generating a structured set of perspective-aware de scriptions that clarify positional relationships. These de composed representations serve as the foundation for the Multi-view Textual-Scene Interaction (Multi-TSI) module, which integrates textual and scene features across multi ple viewpoints using shared, Cross-modal Consistent View Tokens (CCVTs) to preserve spatial correlations. Finally, a Textual-Scene Reasoning module synthesizes multi-view predictions into a unified and robust 3D visual grounding. Experiments on 3D visual grounding datasets show that ViewSRD significantly outperforms state-of-the-art meth ods, particularly in complex queries requiring precise spa tial differentiation. Code is available at https://github. com/visualjason/ViewSRD.

Discipline

Graphics and Human Computer Interfaces | Software Engineering

Research Areas

Software and Cyber-Physical Systems

Publication

2025 International Conference on Computer Vision ICCV: Honolulu, October 19-21: Proceedings

First Page

1

Last Page

11

Publisher

IEEE

City or Country

Pistacataway

Additional URL

https://openaccess.thecvf.com/content/ICCV2025/papers/Huang_ViewSRD_3D_Visual_Grounding_via_Structured_Multi-View_Decomposition_ICCV_2025_paper.pdf

Share

COinS