Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

6-2023

Abstract

Typical layout-to-image synthesis (LIS) models generate images for a close set of semantic classes, e.g., 182 common objects in COCO-Stuff. In this work, we explore the freestyle capability of the model, i.e., how far can it generate unseen semantics (e.g., classes, attributes, and styles) onto a given layout, and call the task Freestyle LIS (FLIS). Thanks to the development of large-scale pre-trained language-image models, a number of discriminative models (e.g., image classification and object detection) trained on limited base classes are empowered with the ability of unseen class prediction. Inspired by this, we opt to leverage large-scale pre-trained text-to-image diffusion models to achieve the generation of unseen semantics. The key challenge of FLIS is how to enable the diffusion model to synthesize images from a specific layout which very likely violates its pre-learned knowledge, e.g., the model never sees "a unicorn sitting on a bench" during its pre-training. To this end, we introduce a new module called Rectified Cross-Attention (RCA) that can be conveniently plugged in the diffusion model to integrate semantic masks. This "plug-in" is applied in each cross-attention layer of the model to rectify the attention maps between image and text tokens. The key idea of RCA is to enforce each text token to act on the pixels in a specified region, allowing us to freely put a wide variety of semantics from pre-trained knowledge (which is general) onto the given layout (which is specific).

Discipline

Databases and Information Systems | Graphics and Human Computer Interfaces

Research Areas

Data Science and Engineering

Publication

Proceedings of the 2023 Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023 June 18-22

First Page

14256

Last Page

14266

Publisher

CVPR

City or Country

Vancouver

Citation

XUE, Han; HUANG, Zhiwu; SUN, Qianru; SONG, Li; and ZHANG, Wenjun. Freestyle layout-to-image synthesis. (2023). Proceedings of the 2023 Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023 June 18-22. 14256-14266.
Available at: https://ink.library.smu.edu.sg/sis_research/8057

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Download

Included in

Databases and Information Systems Commons, Graphics and Human Computer Interfaces Commons

COinS

Research Collection School Of Computing and Information Systems

Freestyle layout-to-image synthesis

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Publication

First Page

Last Page

Publisher

City or Country

Citation

Creative Commons License

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Freestyle layout-to-image synthesis

Author

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Publication

First Page

Last Page

Publisher

City or Country

Citation

Creative Commons License

Included in

Share

Search

Links

Browse

Links