Publication Type

Conference Proceeding Article

Version

acceptedVersion

Publication Date

1-2026

Abstract

Despite the rapid advances in Visual Language Models (VLMs), these models struggle to recognize culture-specific food items. While VLMs are effective in recognizing popular cultural dishes, their performance is suboptimal for dishes that are unique but not widely known internationally. Specifically, VLMs often generate either generic labels or hallucinated names for dishes that are localized to a particular culture. As a result, retrieval-augmented generation (RAG), which retrieves relevant recipes as references for VLMs, emerges as a promising approach. Nevertheless, recipe retrieval, which is itself imperfect, could mislead VLMs into generating inaccurate or culturally inappropriate dish names. This paper presents a comparative study evaluating RAG-based food recognition against conventional approaches, including neural network-based recognition and standalone recipe retrieval. We propose an optimized hybrid framework that integrates the strengths of both VLMs and conventional techniques. The proposed framework achieves the best overall performance in recognizing multicultural dishes and demonstrates robustness in identifying out-of-distribution dishes from a different domain.

Keywords

food recognition, RAG, search re-ranking, VLMs

Discipline

Artificial Intelligence and Robotics | Food Science

Research Areas

Software and Cyber-Physical Systems

Publication

Multimedia Modeling: 32nd International Conference on Multimedia Modeling, MMM 2026, Prague, Czech Republic, January 29-31, Proceedings

First Page

1

Last Page

15

ISBN

9789819569595

Identifier

10.1007/978-981-95-6960-1_29

Publisher

Springer

City or Country

Cham

Share

COinS