Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

10-2023

Abstract

Despite its success in image synthesis, we observe that diffusion probabilistic models (DPMs) often lack contextual reasoning ability to learn the relations among object parts in an image, leading to a slow learning process. To solve this issue, we propose a Masked Diffusion Transformer (MDT) that introduces a mask latent modeling scheme to explicitly enhance the DPMs’ ability to contextual relation learning among object semantic parts in an image. During training, MDT operates in the latent space to mask certain tokens. Then, an asymmetric masking diffusion transformer is designed to predict masked tokens from unmasked ones while maintaining the diffusion generation process. Our MDT can reconstruct the full information of an image from its incomplete contextual input, thus enabling it to learn the associated relations among image tokens. Experimental results show that MDT achieves superior image synthesis performance, e.g., a new SOTA FID score in the ImageNet data set, and has about 3× faster learning speed than the previous SOTA DiT. The source code is released at https://github.com/sail-sg/MDT.

Keywords

Training, Representation learning, Image synthesis, Computational modeling, Synthesizers, Source coding, Semantics

Discipline

Graphics and Human Computer Interfaces

Research Areas

Intelligent Systems and Optimization

Areas of Excellence

Digital transformation

Publication

Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, October 1-6

First Page

23164

Last Page

23173

ISBN

9798350307191

Identifier

10.1109/ICCV51070.2023.02117

Publisher

IEEE

City or Country

Piscataway, NJ

Citation

GAO, Shanghua; ZHOU, Pan; CHENG, Ming-Ming; and YAN, Shuicheng. Masked diffusion transformer is a strong image synthesizer. (2023). Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, October 1-6. 23164-23173.
Available at: https://ink.library.smu.edu.sg/sis_research/9024

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1109/ICCV51070.2023.02117

Download

Included in

Graphics and Human Computer Interfaces Commons

COinS

Research Collection School Of Computing and Information Systems

Masked diffusion transformer is a strong image synthesizer

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Areas of Excellence

Publication

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Masked diffusion transformer is a strong image synthesizer

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Areas of Excellence

Publication

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links