Publication Type
Conference Proceeding Article
Version
acceptedVersion
Publication Date
10-2024
Abstract
Learning recipe and food image representation in common embedding space is non-trivial but crucial for cross-modal recipe retrieval. In this paper, we propose a new perspective for this problem by utilizing foundation models for data augmentation. Leveraging on the remarkable capabilities of foundation models (i.e., Llama2 and SAM), we propose to augment recipe and food image by extracting alignable information related to the counterpart. Specifically, Llama2 is employed to generate a textual description from the recipe, aiming to capture the visual cues of a food image, and SAM is used to produce image segments that correspond to key ingredients in the recipe. To make full use of the augmented data, we introduce Data Augmented Retrieval framework (DAR) to enhance recipe and image representation learning for cross-modal retrieval. We first inject adapter layers to pre-trained CLIP model to reduce computation cost rather than fully fine-tuning all the parameters. In addition, multi-level circle loss is proposed to align the original and augmented data pairs, which assigns different penalties for positive and negative pairs. On the Recipe1M dataset, our DAR outperforms all existing methods by a large margin. Extensive ablation studies validate the effectiveness of each component of DAR. Code is available at https://github.com/Noah888/DAR.
Keywords
Recipe retrieval, Data augmentation, Foundation models
Discipline
Databases and Information Systems | Graphics and Human Computer Interfaces
Research Areas
Intelligent Systems and Optimization
Areas of Excellence
Digital transformation
Publication
Proceedings of the18th European Conference, Milan, Italy, 2024 September 29-October 4
First Page
111
Last Page
127
ISBN
9783031729829
Identifier
10.1007/978-3-031-72983-6_7
Publisher
Springer
City or Country
Cham
Citation
SONG, Fangzhou; ZHU, Bin; HAO, Yanbin; and WANG, Shuo.
Enhancing recipe retrieval with foundation models: A data augmentation perspective. (2024). Proceedings of the18th European Conference, Milan, Italy, 2024 September 29-October 4. 111-127.
Available at: https://ink.library.smu.edu.sg/sis_research/9726
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1007/978-3-031-72983-6_7
Included in
Databases and Information Systems Commons, Graphics and Human Computer Interfaces Commons