Publication Type

Journal Article

Version

publishedVersion

Publication Date

10-2025

Abstract

The power of visual language models is showcased in visual understanding tasks, where language-guided models achieve impressive flexibility and precision. In this paper, we ex tend this capability to the challenging domain of image matting by framing it as a soft grounding problem, enabling a single diffusion model to handle diverse objects, textures, and transparencies, all directed by descriptive text prompts. Our method teaches the diffusion model to ground alpha mattes by guiding it through a process of instance-level localization and transparency estimation. First, we introduce an intermediate objective that trains the model to accurately localize semantic components of the matte based on natural language cues, establishing a robust spatial foundation. Building on this, the model progressively refines its transparency estimation abilities, using the learned semantic structure as a prior to enhance the precision of alpha matte predictions. By treating spatial localization and transparency estimation as distinct learning objectives, our approach allows the model to fully leverage the semantic depth of diffusion models, removing the need for rigid visual pri ors. Extensive experiments highlight our model’s adaptability, precision, and computational efficiency, setting a new benchmark for flexible, text-driven image matting solutions. The code is available at https://github.com/xty435768/TeachDiffusionMatting.

Discipline

Graphics and Human Computer Interfaces | Software Engineering

Research Areas

Software and Cyber-Physical Systems

Publication

Transactions on Machine Learning Research

First Page

1

Last Page

29

Publisher

JMLR

Additional URL

https://openreview.net/pdf?id=2gNy9Yeg8J

Share

COinS