Research Collection School Of Computing and Information Systems

Unsupervised visual chain-of-thought reasoning via preference optimization

Publication Type

Conference Proceeding Article

Version

acceptedVersion

Publication Date

10-2025

Abstract

Chain-of-thought (CoT) reasoning greatly improves the interpretability and problem-solving abilities of multimodal large language models (MLLMs). However, existing ap proaches focus on text CoT, limiting their ability to lever age visual cues. Visual CoT remains underexplored, and the only work [35] is based on supervised fine-tuning that relies on extensive labeled bounding-box data and is hard to generalize to unseen cases. In this paper, we introduce Unsupervised Visual CoT (UV-CoT), a novel framework for image-level CoT reasoning via preference optimization. UV-CoTperforms preference comparisons between model generated bounding boxes (one is preferred and the other is dis-preferred), eliminating the need for bounding-box an notations. We get such preference data by introducing an automatic data generation pipeline. Given an image, our target MLLM (e.g., LLaVA-1.5-7B) generates seed bound ing boxes using a template prompt and then answers the question using each bounded region as input. An eval uator MLLM (e.g., OmniLLM-12B) ranks the responses, and these rankings serve as supervision to train the target MLLMwithUV-CoTbyminimizingnegative log-likelihood losses. By emulating human perception–identifying key regions and reasoning based on them–UV-CoT can im prove visual comprehension, particularly in spatial reason ing tasks where textual descriptions alone fall short. Our experiments on six datasets demonstrate the superiority of UV-CoT, compared to the state-of-the-art textual and vi sual CoT methods. Our zero-shot testing on four unseen datasets shows the strong generalization of UV-CoT.

Discipline

Artificial Intelligence and Robotics | Educational Methods

Research Areas

Software and Cyber-Physical Systems

Publication

2025 International Conference on Computer Vision ICCV: Honolulu, October 19-23, Proceedings

First Page

Last Page

Publisher

IEEE Computer Society

City or Country

Washington, DC

Citation

ZHAO, Kesen; ZHU, Beier; SUN, Qianru; and ZHANG, Hanwang. Unsupervised visual chain-of-thought reasoning via preference optimization. (2025). 2025 International Conference on Computer Vision ICCV: Honolulu, October 19-23, Proceedings. 1-10.
Available at: https://ink.library.smu.edu.sg/sis_research/10881

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://openaccess.thecvf.com/content/ICCV2025/papers/Zhao_Unsupervised_Visual_Chain-of-Thought_Reasoning_via_Preference_Optimization_ICCV_2025_paper.pdf

Download

Included in

Artificial Intelligence and Robotics Commons, Educational Methods Commons

COinS

Research Collection School Of Computing and Information Systems

Unsupervised visual chain-of-thought reasoning via preference optimization

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Publication

First Page

Last Page

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Unsupervised visual chain-of-thought reasoning via preference optimization

Author

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Publication

First Page

Last Page

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links