Publication Type
Journal Article
Version
publishedVersion
Publication Date
1-2026
Abstract
Vision-and-Language Navigation in continuous environments (VLN-CE) requires an embodied robot to navigate the target destination following the natural language instruction. Most existing methods use panoramic RGB-D cameras for 360° observation of environments. However, these methods struggle in real-world applications because of the higher cost of panoramic RGB-D cameras. This paper studies a low-cost and practical VLN-CE setting, e.g., using monocular cameras of limited field of view, which means “Look Less” for visual observations and environment semantics. In this paper, we propose a ThinkMatter framework for monocular VLN-CE, where we motivate monocular robots to “Think More” by 1) generating novel views and 2) integrating instruction semantics. Specifically, we achieve the former by the proposed 3DGS-based panoramic generation to render novel views at each step, based on past observation collections. We achieve the latter by the proposed enhancement of the occupancy-instruction semantics, which integrates the spatial semantics of occupancy maps with the textual semantics of language instructions. These operations promote monocular robots with wider environment perceptions as well as transparent semantic connections with the instruction. Both extensive experiments in the simulators and real-world environments demonstrate the effectiveness of ThinkMatter, providing a promising practice for real-world navigation.
Keywords
vision-and-language navigation, panoramic view synthesis, semantic map learning
Discipline
Artificial Intelligence and Robotics | Databases and Information Systems
Research Areas
Data Science and Engineering; Intelligent Systems and Optimization
Publication
IEEE Transactions on Image Processing
Volume
74
First Page
875
Last Page
903
ISSN
1057-7149
Identifier
10.1109/TIP.2026.3652003
Publisher
Institute of Electrical and Electronics Engineers
Citation
DAI, Guangzhao; WANG, Shuo; ZHAO, Hao; ZHU, Bin; SUN, Qianru; and SHU, Xiangbo.
ThinkMatter: Panoramic-aware instructional semantics for monocular vision-and-language navigation. (2026). IEEE Transactions on Image Processing. 74, 875-903.
Available at: https://ink.library.smu.edu.sg/sis_research/10905
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1109/TIP.2026.3652003