Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
10-2023
Abstract
Despite its success in image synthesis, we observe that diffusion probabilistic models (DPMs) often lack contextual reasoning ability to learn the relations among object parts in an image, leading to a slow learning process. To solve this issue, we propose a Masked Diffusion Transformer (MDT) that introduces a mask latent modeling scheme to explicitly enhance the DPMs’ ability to contextual relation learning among object semantic parts in an image. During training, MDT operates in the latent space to mask certain tokens. Then, an asymmetric masking diffusion transformer is designed to predict masked tokens from unmasked ones while maintaining the diffusion generation process. Our MDT can reconstruct the full information of an image from its incomplete contextual input, thus enabling it to learn the associated relations among image tokens. Experimental results show that MDT achieves superior image synthesis performance, e.g., a new SOTA FID score in the ImageNet data set, and has about 3× faster learning speed than the previous SOTA DiT. The source code is released at https://github.com/sail-sg/MDT.
Keywords
Training, Representation learning, Image synthesis, Computational modeling, Synthesizers, Source coding, Semantics
Discipline
Graphics and Human Computer Interfaces
Research Areas
Intelligent Systems and Optimization
Areas of Excellence
Digital transformation
Publication
Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, October 1-6
First Page
23164
Last Page
23173
ISBN
9798350307191
Identifier
10.1109/ICCV51070.2023.02117
Publisher
IEEE
City or Country
Piscataway, NJ
Citation
GAO, Shanghua; ZHOU, Pan; CHENG, Ming-Ming; and YAN, Shuicheng.
Masked diffusion transformer is a strong image synthesizer. (2023). Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, October 1-6. 23164-23173.
Available at: https://ink.library.smu.edu.sg/sis_research/9024
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1109/ICCV51070.2023.02117