Publication Type
Conference Proceeding Article
Version
acceptedVersion
Publication Date
7-2025
Abstract
Fine-grained ingredient recognition presents a significant challenge due to the diverse appearances of ingredients, resulting from different cutting and cooking methods. While existing approaches have shown promising results, they still require extensive training costs and focus solely on fine-grained ingredient recognition. In this paper, we address these limitations by introducing an efficient prompt-tuning framework that adapts pretrained visual-language models (VLMs), such as CLIP, to the ingredient recognition task without requiring full model finetuning. Additionally, we introduce three-level ingredient hierarchies to enhance both training performance and evaluation robustness. Specifically, we propose a hierarchical ingredient recognition task, designed to evaluate model performance across different hierarchical levels (e.g., chicken chunks, chicken, meat), capturing recognition capabilities from coarse- to fine-grained categories. Our method leverages hierarchical labels, training prompt-tuned models with both fine-grained and corresponding coarse-grained labels. Experimental results on the VireoFood172 dataset demonstrate the effectiveness of prompt-tuning with hierarchical labels, achieving superior performance. Moreover, the hierarchical ingredient recognition task provides valuable insights into the model’s ability to generalize across different levels of ingredient granularity.
Keywords
Hierarchical ingredient recognition, prompt tuning and vision-language model
Discipline
Graphics and Human Computer Interfaces
Research Areas
Intelligent Systems and Optimization
Areas of Excellence
Digital transformation
Publication
Proceedings of the 2025 IEEE International Conference on Multimedia and Expo (ICME 2025), Nantes, France, June 30 - July 4
First Page
1
Last Page
6
Identifier
10.48550/arXiv.2504.10322
Publisher
IEEE
City or Country
Piscataway, NJ
Citation
GUI, Yinxuan; ZHU, Bin; CHEN, Jingjing; and NGO, Chong-wah.
Efficient prompt tuning for hierarchical ingredient recognition. (2025). Proceedings of the 2025 IEEE International Conference on Multimedia and Expo (ICME 2025), Nantes, France, June 30 - July 4. 1-6.
Available at: https://ink.library.smu.edu.sg/sis_research/10365
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.48550/arXiv.2504.10322