Publication Type
Journal Article
Version
publishedVersion
Publication Date
10-2025
Abstract
The power of visual language models is showcased in visual understanding tasks, where language-guided models achieve impressive flexibility and precision. In this paper, we ex tend this capability to the challenging domain of image matting by framing it as a soft grounding problem, enabling a single diffusion model to handle diverse objects, textures, and transparencies, all directed by descriptive text prompts. Our method teaches the diffusion model to ground alpha mattes by guiding it through a process of instance-level localization and transparency estimation. First, we introduce an intermediate objective that trains the model to accurately localize semantic components of the matte based on natural language cues, establishing a robust spatial foundation. Building on this, the model progressively refines its transparency estimation abilities, using the learned semantic structure as a prior to enhance the precision of alpha matte predictions. By treating spatial localization and transparency estimation as distinct learning objectives, our approach allows the model to fully leverage the semantic depth of diffusion models, removing the need for rigid visual pri ors. Extensive experiments highlight our model’s adaptability, precision, and computational efficiency, setting a new benchmark for flexible, text-driven image matting solutions. The code is available at https://github.com/xty435768/TeachDiffusionMatting.
Discipline
Graphics and Human Computer Interfaces | Software Engineering
Research Areas
Software and Cyber-Physical Systems
Publication
Transactions on Machine Learning Research
First Page
1
Last Page
29
Publisher
JMLR
Citation
XIANG, Tianyi; ZHENG, Weiying; JIANG, Yutao; SHEN, Tingrui; YU, Hewei; XU, Yangyang; and HE, Shengfeng.
Teaching diffusion models to ground alpha matte. (2025). Transactions on Machine Learning Research. 1-29.
Available at: https://ink.library.smu.edu.sg/sis_research/10514
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://openreview.net/pdf?id=2gNy9Yeg8J