Research Collection School Of Computing and Information Systems

Food recognition with visual language models: Search re-ranking or retrieval-augmented generation?

Publication Type

Conference Proceeding Article

Version

acceptedVersion

Publication Date

1-2026

Abstract

Despite the rapid advances in Visual Language Models (VLMs), these models struggle to recognize culture-specific food items. While VLMs are effective in recognizing popular cultural dishes, their performance is suboptimal for dishes that are unique but not widely known internationally. Specifically, VLMs often generate either generic labels or hallucinated names for dishes that are localized to a particular culture. As a result, retrieval-augmented generation (RAG), which retrieves relevant recipes as references for VLMs, emerges as a promising approach. Nevertheless, recipe retrieval, which is itself imperfect, could mislead VLMs into generating inaccurate or culturally inappropriate dish names. This paper presents a comparative study evaluating RAG-based food recognition against conventional approaches, including neural network-based recognition and standalone recipe retrieval. We propose an optimized hybrid framework that integrates the strengths of both VLMs and conventional techniques. The proposed framework achieves the best overall performance in recognizing multicultural dishes and demonstrates robustness in identifying out-of-distribution dishes from a different domain.

Keywords

food recognition, RAG, search re-ranking, VLMs

Discipline

Artificial Intelligence and Robotics | Food Science

Research Areas

Software and Cyber-Physical Systems

Publication

Multimedia Modeling: 32nd International Conference on Multimedia Modeling, MMM 2026, Prague, Czech Republic, January 29-31, Proceedings

First Page

Last Page

ISBN

9789819569595

Identifier

10.1007/978-981-95-6960-1_29

Publisher

Springer

City or Country

Cham

Citation

GAN, Kian Yu; NGUYEN, Phuong Anh; and NGO, Chong-wah. Food recognition with visual language models: Search re-ranking or retrieval-augmented generation?. (2026). Multimedia Modeling: 32nd International Conference on Multimedia Modeling, MMM 2026, Prague, Czech Republic, January 29-31, Proceedings. 1-15.
Available at: https://ink.library.smu.edu.sg/sis_research/11036

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Download

Included in

Artificial Intelligence and Robotics Commons, Food Science Commons

COinS

Research Collection School Of Computing and Information Systems

Food recognition with visual language models: Search re-ranking or retrieval-augmented generation?

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Creative Commons License

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Food recognition with visual language models: Search re-ranking or retrieval-augmented generation?

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Creative Commons License

Included in

Share

Search

Links

Browse

Links