Publication Type
Conference Proceeding Article
Version
acceptedVersion
Publication Date
10-2025
Abstract
Chain-of-thought (CoT) reasoning greatly improves the interpretability and problem-solving abilities of multimodal large language models (MLLMs). However, existing ap proaches focus on text CoT, limiting their ability to lever age visual cues. Visual CoT remains underexplored, and the only work [35] is based on supervised fine-tuning that relies on extensive labeled bounding-box data and is hard to generalize to unseen cases. In this paper, we introduce Unsupervised Visual CoT (UV-CoT), a novel framework for image-level CoT reasoning via preference optimization. UV-CoTperforms preference comparisons between model generated bounding boxes (one is preferred and the other is dis-preferred), eliminating the need for bounding-box an notations. We get such preference data by introducing an automatic data generation pipeline. Given an image, our target MLLM (e.g., LLaVA-1.5-7B) generates seed bound ing boxes using a template prompt and then answers the question using each bounded region as input. An eval uator MLLM (e.g., OmniLLM-12B) ranks the responses, and these rankings serve as supervision to train the target MLLMwithUV-CoTbyminimizingnegative log-likelihood losses. By emulating human perception–identifying key regions and reasoning based on them–UV-CoT can im prove visual comprehension, particularly in spatial reason ing tasks where textual descriptions alone fall short. Our experiments on six datasets demonstrate the superiority of UV-CoT, compared to the state-of-the-art textual and vi sual CoT methods. Our zero-shot testing on four unseen datasets shows the strong generalization of UV-CoT.
Discipline
Artificial Intelligence and Robotics | Educational Methods
Research Areas
Software and Cyber-Physical Systems
Publication
2025 International Conference on Computer Vision ICCV: Honolulu, October 19-23, Proceedings
First Page
1
Last Page
10
Publisher
IEEE Computer Society
City or Country
Washington, DC
Citation
ZHAO, Kesen; ZHU, Beier; SUN, Qianru; and ZHANG, Hanwang.
Unsupervised visual chain-of-thought reasoning via preference optimization. (2025). 2025 International Conference on Computer Vision ICCV: Honolulu, October 19-23, Proceedings. 1-10.
Available at: https://ink.library.smu.edu.sg/sis_research/10881
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://openaccess.thecvf.com/content/ICCV2025/papers/Zhao_Unsupervised_Visual_Chain-of-Thought_Reasoning_via_Preference_Optimization_ICCV_2025_paper.pdf