Publication Type

Conference Proceeding Article

Version

acceptedVersion

Publication Date

10-2025

Abstract

Chain-of-thought (CoT) reasoning greatly improves the interpretability and problem-solving abilities of multimodal large language models (MLLMs). However, existing ap proaches focus on text CoT, limiting their ability to lever age visual cues. Visual CoT remains underexplored, and the only work [35] is based on supervised fine-tuning that relies on extensive labeled bounding-box data and is hard to generalize to unseen cases. In this paper, we introduce Unsupervised Visual CoT (UV-CoT), a novel framework for image-level CoT reasoning via preference optimization. UV-CoTperforms preference comparisons between model generated bounding boxes (one is preferred and the other is dis-preferred), eliminating the need for bounding-box an notations. We get such preference data by introducing an automatic data generation pipeline. Given an image, our target MLLM (e.g., LLaVA-1.5-7B) generates seed bound ing boxes using a template prompt and then answers the question using each bounded region as input. An eval uator MLLM (e.g., OmniLLM-12B) ranks the responses, and these rankings serve as supervision to train the target MLLMwithUV-CoTbyminimizingnegative log-likelihood losses. By emulating human perception–identifying key regions and reasoning based on them–UV-CoT can im prove visual comprehension, particularly in spatial reason ing tasks where textual descriptions alone fall short. Our experiments on six datasets demonstrate the superiority of UV-CoT, compared to the state-of-the-art textual and vi sual CoT methods. Our zero-shot testing on four unseen datasets shows the strong generalization of UV-CoT.

Discipline

Artificial Intelligence and Robotics | Educational Methods

Research Areas

Software and Cyber-Physical Systems

Publication

2025 International Conference on Computer Vision ICCV: Honolulu, October 19-23, Proceedings

First Page

1

Last Page

10

Publisher

IEEE Computer Society

City or Country

Washington, DC

Additional URL

https://openaccess.thecvf.com/content/ICCV2025/papers/Zhao_Unsupervised_Visual_Chain-of-Thought_Reasoning_via_Preference_Optimization_ICCV_2025_paper.pdf

Share

COinS