Publication Type
Journal Article
Version
publishedVersion
Publication Date
7-2025
Abstract
Large Multi-modal Models (LMMs) have made impressive progress in many vision-language tasks. Nevertheless, the performance of general LMMs in specific domains is still far from satisfactory. This paper proposes FoodLMM, a versatile food assistant based on LMMs with various capabilities, including food recognition, ingredient recognition, recipe generation, nutrition estimation, food segmentation and multi-round conversation. To facilitate FoodLMM to deal with tasks beyond pure text output, we introduce a series of novel task-specific tokens and heads, enabling the model to predict food nutritional values and multiple segmentation masks. We adopt a two-stage training strategy. In the first stage, we utilize multiple public food benchmarks for multi-task learning by leveraging the instruct-following paradigm. In the second stage, we construct a multi-round conversation dataset and a reasoning segmentation dataset to fine-tune the model, enabling it to conduct professional dialogues and generate segmentation masks based on complex reasoning in the food domain. Our fine-tuned FoodLMM achieves state-of-the-art results across several food benchmarks. We will make our code, models and datasets publicly available.
Keywords
Food Assistant, Large Multi-modal Models, Multi-tasks
Discipline
Artificial Intelligence and Robotics
Research Areas
Intelligent Systems and Optimization
Areas of Excellence
Digital transformation
Publication
IEEE Transactions on Multimedia
First Page
1
Last Page
38
ISSN
1520-9210
Identifier
10.48550/arXiv.2312.14991
Publisher
Institute of Electrical and Electronics Engineers
Citation
YIN, Yuehao; QI, Huiyan; ZHU, Bin; CHEN, Jingjing; JIANG, Yu-Gang; and NGO, Chong-wah.
FoodLMM: A versatile food assistant using large multi-modal model. (2025). IEEE Transactions on Multimedia. 1-38.
Available at: https://ink.library.smu.edu.sg/sis_research/10385
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.48550/arXiv.2312.14991