Publication Type
Journal Article
Version
acceptedVersion
Publication Date
10-2025
Abstract
Recent advancements in text-to-image generation models have excelled in creating diverse and realistic images. This success extends to food imagery, where various conditional inputs like cooking styles, ingredients, and recipes are utilized. However, a yet-unexplored challenge is generating a sequence of procedural images based on cooking steps from a recipe. This could enhance the cooking experience with visual guidance and possibly lead to an intelligent cooking simulation system. To fill this gap, we introduce a novel task called cooking procedural image generation. This task is inherently demanding, as it strives to create photo-realistic images that align with cooking steps while preserving sequential consistency. To collectively tackle these challenges, we present CookingDiffusion, a novel approach that leverages Stable Diffusion and three innovative Memory Nets to model procedural prompts. These prompts encompass text prompts (representing cooking steps), image prompts (corresponding to cooking images), and multi-modal prompts (mixing cooking steps and images), ensuring the consistent generation of cooking procedural images. To validate the effectiveness of our approach, we preprocess the YouCookII dataset, establishing a new benchmark. Our experimental results demonstrate that our model excels at generating high-quality cooking procedural images with remarkable consistency across sequential cooking steps, as measured by both the FID and the proposed Average Procedure Consistency metrics. Furthermore, CookingDiffusion demonstrates the ability to manipulate ingredients and cooking methods in a recipe. We will make our code, models, and dataset publicly accessible.
Keywords
Cooking Procedural Image Generation, Procedural Prompts, CookingDiffusion, Memory Net
Discipline
Artificial Intelligence and Robotics | Graphics and Human Computer Interfaces
Research Areas
Intelligent Systems and Optimization
Areas of Excellence
Digital transformation
Publication
ACM Transactions on Multimedia Computing, Communications and Applications
First Page
1
Last Page
23
ISSN
1551-6857
Identifier
10.1145/3771995
Publisher
Association for Computing Machinery (ACM)
Citation
WANG, Yuan; ZHU, Bin; HAO, Yanbin; NGO, Chong-wah; TAN, Yi; and WANG, Xiang.
CookingDiffusion: Cooking procedural image generation with Stable Diffusion. (2025). ACM Transactions on Multimedia Computing, Communications and Applications. 1-23.
Available at: https://ink.library.smu.edu.sg/sis_research/10468
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1145/3771995
Included in
Artificial Intelligence and Robotics Commons, Graphics and Human Computer Interfaces Commons