Research Collection School Of Computing and Information Systems

Seeing culture: A benchmark for visual reasoning and grounding

Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

11-2025

Abstract

Multimodal vision-language models (VLMs) have made substantial progress in various tasks that require a combined understanding of visual and textual content, particularly in cultural understanding tasks, with the emergence of new cultural datasets. However, these datasets frequently fall short of providing cultural reasoning while underrepresenting many cultures.In this paper, we introduce the Seeing Culture Benchmark (SCB), focusing on cultural reasoning with a novel approach that requires VLMs to reason on culturally rich images in two stages: i) selecting the correct visual option with multiple-choice visual question answering (VQA), and ii) segmenting the relevant cultural artifact as evidence of reasoning. Visual options in the first stage are systematically organized into three types: those originating from the same country, those from different countries, or a mixed group. Notably, all options are derived from a singular category for each type. Progression to the second stage occurs only after a correct visual option is chosen. The SCB benchmark comprises 1,065 images that capture 138 cultural artifacts across five categories from seven Southeast Asia countries, whose diverse cultures are often overlooked, accompanied by 3,178 questions, of which 1,093 are unique and meticulously curated by human annotators. Our evaluation of various VLMs reveals the complexities involved in cross-modal cultural reasoning and highlights the disparity between visual reasoning and spatial grounding in culturally nuanced scenarios. The SCB serves as a crucial benchmark for identifying these shortcomings, thereby guiding future developments in the field of cultural reasoning. https://github.com/buraksatar/SeeingCulture

Discipline

Artificial Intelligence and Robotics | Graphics and Human Computer Interfaces

Research Areas

Intelligent Systems and Optimization

Areas of Excellence

Digital transformation

Publication

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Suzhou, China, November 4-9

First Page

22238

Last Page

22254

Identifier

10.18653/v1/2025.emnlp-main.1131

Publisher

ACL

City or Country

Suzhou

Citation

SATAR, Burak; MA, Zhixin; IRRAWAN, Patrick Amadeus; MULYAWAN, Wilfried Ariel; JIANG, Jing; LIM, Ee-Peng; and NGO, Chong-wah. Seeing culture: A benchmark for visual reasoning and grounding. (2025). Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Suzhou, China, November 4-9. 22238-22254.
Available at: https://ink.library.smu.edu.sg/sis_research/10715

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.18653/v1/2025.emnlp-main.1131

Download

Included in

Artificial Intelligence and Robotics Commons, Graphics and Human Computer Interfaces Commons

COinS

Research Collection School Of Computing and Information Systems

Seeing culture: A benchmark for visual reasoning and grounding

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Areas of Excellence

Publication

First Page

Last Page

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Seeing culture: A benchmark for visual reasoning and grounding

Author

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Areas of Excellence

Publication

First Page

Last Page

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links