Publication Type

Conference Proceeding Article

Version

acceptedVersion

Publication Date

12-2024

Abstract

Despite their success, unsupervised domain adaptation methods for semantic segmentation primarily focus on adaptation between image domains and do not utilize other abundant visual modalities like depth, infrared and event. This limitation hinders their performance and restricts their application in real-world multimodal scenarios. To address this issue, we propose Modality Adaptation with text-toimage Diffusion Models (MADM) for semantic segmentation task which utilizes text-to-image diffusion models pre-trained on extensive image-text pairs to enhance the model’s cross-modality capabilities. Specifically, MADM comprises two key complementary components to tackle major challenges. First, due to the large modality gap, using one modal data to generate pseudo labels for another modality suffers from a significant drop in accuracy. To address this, MADM designs diffusion-based pseudo-label generation which adds latent noise to stabilize pseudolabels and enhance label accuracy. Second, to overcome the limitations of latent low-resolution features in diffusion models, MADM introduces the label palette and latent regression which converts one-hot encoded labels into the RGB form by palette and regresses them in the latent space, thus ensuring the pre-trained decoder for up-sampling to obtain fine-grained features. Extensive experimental results demonstrate that MADM achieves state-of-the-art adaptation performance across various modality tasks, including images to depth, infrared, and event modalities.

Keywords

Semantic Segmentation, Modality adaptation, Text-to-Image diffusion models, Domain adaptation

Discipline

Computer Sciences | Graphics and Human Computer Interfaces

Research Areas

Data Science and Engineering; Intelligent Systems and Optimization

Publication

Proceedings of 38th Annual Conference on Neural Information Processing Systems (NeurIPS 2024) : Vancouver, Canada, December 10-15

Identifier

https://nips.cc/virtual/2024/poster/96606

Publisher

NeurIPS

City or Country

Vancouver, Canada

Citation

XIA, Ruihao; LIANG, Yu; JIANG, Peng-Tao; ZHANG, Hao; LI, Bo; TANG, Yang; and ZHOU, Pan. Unsupervised modality adaptation with text-to-Image diffusion models for semantic segmentation. (2024). Proceedings of 38th Annual Conference on Neural Information Processing Systems (NeurIPS 2024) : Vancouver, Canada, December 10-15.
Available at: https://ink.library.smu.edu.sg/sis_research/9729

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Download

Included in

Graphics and Human Computer Interfaces Commons

COinS

Research Collection School Of Computing and Information Systems

Unsupervised modality adaptation with text-to-Image diffusion models for semantic segmentation

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Identifier

Publisher

City or Country

Citation

Creative Commons License

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Unsupervised modality adaptation with text-to-Image diffusion models for semantic segmentation

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Identifier

Publisher

City or Country

Citation

Creative Commons License

Included in

Share

Search

Links

Browse

Links