Publication Type

PhD Dissertation

Version

publishedVersion

Publication Date

12-2023

Abstract

Semantic segmentation is a fundamental task in computer vision that assigns a label to every pixel in an image based on the semantic meaning of the objects present. It demands a large amount of pixel-level labeled images for training deep models. Weakly-supervised semantic segmentation (WSSS) is a more feasible approach that uses only weak annotations to learn the segmentation task. Image-level label based WSSS is the most challenging and popular, where only the class label for the entire image is provided as supervision. To address this challenge, Class Activation Map (CAM) has emerged as a powerful technique in WSSS. CAM provides a way to visualize the areas of an image that are most relevant to a particular class without requiring pixel-level annotations. However, CAM is generated from the classification model, and it often only highlights the most discriminative parts of the object due to the discriminative nature of the model.

This dissertation examines the key issues behind conventional CAM and proposes corresponding solutions. Two of our completed works focus on two crucial steps in CAM generation: training a classification model and computing CAM from the classification model. The first work discusses the disadvantage of a key component to training a good classification model — binary cross-entropy (BCE) loss function. We introduce a simple method: reactivating the converged CAM with BCE by using softmax cross-entropy loss (SCE). Thanks to the contrastive nature of SCE, the pixel response is disentangled into different classes, and hence less mask ambiguity is expected. Then, in our second completed work, we aim to improve the quality of CAM given a trained classification model. Specifically, we introduce a new computation method for CAM that captures non-discriminative features, resulting in expanded CAM coverage to cover whole objects. This is achieved by clustering on all local features of an object class to derive local prototypes, representing local semantics such as the “head”, “leg”, and “body” of a “sheep”. Our CAM captures all local features of the class without discrimination.

Although the two completed works have brought significant improvements to conventional CAM, the improved CAM may still face a bottleneck due to the limited training data and the co-occurrence of objects and backgrounds. In this dissertation, we investigate the applicability of the recent visual foundation models, such as the Segment Anything Model (SAM), in the context of WSSS. SAM is a recent image segmentation model exhibiting superior performance across various segmentation tasks. It is remarkable for its capability to interpret diverse prompts and successively generate various object masks. We scrutinize SAM in two intriguing scenarios: text prompting and zero-shot learning, and we propose related pipelines for its application in WSSS. We provide insights into the potential and challenges of deploying visual foundation models for WSSS, facilitating future developments in this exciting research area.

Keywords

Computer Vision, Semantic Segmentation, Weakly-Supervised Learning

Degree Awarded

PhD in Computer Science

Discipline

Artificial Intelligence and Robotics | Computer Sciences

Supervisor(s)

SUN, Qianru

First Page

1

Last Page

101

Publisher

Singapore Management University

City or Country

Singapore

Copyright Owner and License

Author

Share

COinS