Publication Type

Conference Proceeding Article

Version

acceptedVersion

Publication Date

10-2025

Abstract

Recently, Multimodal Large Language Models (MLLMs) have achieved significant success across multiple disciplines due to their exceptional instruction-following capabilities and extensive world knowledge. However, whether these MLLMs possess human-like compositional reasoning abilities remains an open problem. To unveil their reasoning behaviors, we first curate a Multimodal Assumptive Reasoning Benchmark (MARS-Bench) in this paper. Interestingly, we find that most prevalent MLLMs can be easily fooled by the introduction of a presupposition into the question, whereas such presuppositions appear naive to human reasoning. Besides, we also propose a simple yet effective method, Active Deduction (AD), a novel reinforcement learning paradigm to encourage the model to actively perform composite deduction before reaching a final decision. Equipped with the proposed AD method, a MLLM demonstrates significant improvements in assumptive reasoning abilities without compromising its general-purpose question-answering performance. We also provide extensive evaluations of both opensource and private MLLMs on MARS-Bench, along with experimental analyses of the AD method.

Keywords

Assumptive reasoning, MLLMs, VQA, Benchmark, GRPO

Discipline

Graphics and Human Computer Interfaces

Research Areas

Intelligent Systems and Optimization

Areas of Excellence

Digital transformation

Publication

MM '25: The 33rd ACM International Conference on Multimedia, Dublin Ireland, October 27-31

First Page

2713

Last Page

2722

ISBN

9798400720352

Identifier

10.1145/3746027.3754720

Publisher

ACM

City or Country

New York

Additional URL

https://doi.org/10.1145/3746027.3754720

Share

COinS