Publication Type

PhD Dissertation

Version

publishedVersion

Publication Date

5-2024

Abstract

In recent years, remarkable progress has been made in Artificial Intelligence (AI), with an increasing focus on integrating AI systems into people’s daily lives. In the context of our diverse world, research attention has shifted towards applying AI to multimodal understanding tasks. This thesis specifically addresses two key modalities, namely, vision and language, and explores Vision-Language Understanding (VLU).

In the past, addressing VLU tasks involved training distinct models from scratch using task-specific data. However, limited by the amount of training data, models may easily overfit the training data and fail to generalize. A recent breakthrough is the development of Pre-trained Models (PTMs), which are trained on extensive datasets to acquire universal representations. Leveraging these PTMs for VLU tasks has become a prevalent approach.

The use of PTMs for VLU tasks can be divided into two paradigms: (1) finetuning PTMs with downstream task data, and (2) zero-shot transfer or few-shot learning based on frozen PTMs. However, existing methods under these two paradigms suffer from a few limitations: direct fine-tuning of PTMs may overlook the unique characteristics of the downstream tasks; the zero-shot and few-shot performance of PTMs on some tasks may be poor; and complex VLU tasks may require multiple reasoning skills that a single PTM may not possess.

In the thesis, we aim to address the limitations above by optimizing the utilization of PTMs for VLU tasks. Our work can be organized based on whether we leverage fine-tuning or zero-shot / few-shot learning, and whether we adopt a single PTM or a composition of PTMs. When tuning a single PTM, we explore how to incorporate task-specific components to better cater to downstream tasks (Tuning-Single). For VLU tasks where frozen PTMs are not ideal solutions due to poor performance, we investigate using a single frozen PTM to facilitate sub-steps in these tasks (Frozen-Single). On the other hand, we also study how to compose a set of tuned PTMs, each capable of a reasoning skill, to improve the performance on these tasks in the low-resource setting (Tuning-Composition). As VLU tasks may involve multiple skills and multiple reasoning steps, we consider a composition of frozen PTMs and assign reasoning tasks to proper frozen PTMs without requiring any adaptation (Frozen-Composition).

Specifically, in this thesis, we narrow down our scope to two VLU tasks, Hateful Meme Detection (HMD) and Visual Question Answering (VQA). HMD classifies a given multimodal meme as either hateful or not hateful, while VQA aims to answer questions related to a given image. The decision to focus on these two tasks stems from their importance in real-world applications. Furthermore, both tasks present non-trivial challenges that demand innovative solution approaches.

For the HMD task, most existing work has primarily focused on direct fine-tuning of PTMs, treating HMD as a general multimodal classification task and overlooking its unique characteristics. We address the limitation by integrating task-specific components with PTMs and tuning them end-to-end. We proposed DisMultiHate,which is based on a PTM but learns to disentangle representations of hate speech related target entities in memes to enhance hateful content classification.

Additionally, HMD often requires external background knowledge for meme comprehension, yet there are no dedicated knowledge bases constructed for this purpose. In light of this, we explore leveraging knowledge in Pre-trained Language Models (PT-LMs). We propose PromptHate, which prompts PT-LMs and utilizes their implicit knowledge for HMD. Since PT-LMs are inherently textual, PromptHate involves converting images into textual captions with a frozen pre-trained vision-language model (PTVLM).

Though achieving good detection performance, PromptHate suffers from noninformative captions. Generic image descriptions may lack crucial details, such as race and gender information, vital for detecting hateful content. To address this, we proposed Pro-Cap, which leverages a frozen PT-VLM to complement PromptHate. Specifically, we prompt a frozen PT-VLM with hateful content-related questions and use the answers as image captions (termed Pro-Cap), ensuring that the captions contain critical information for hateful content detection.

While these methods exhibit commendable performance, they heavily rely on extensive supervised learning, demanding large volumes of annotated data. This process is both costly and time-consuming. In response, we further introduce ModHATE, which harnesses the power of a composition of tuned PTMs, each of which possesses an essential reasoning capability for HMD. To the best of our knowledge, Mod-HATE represents a pioneering exploration of hateful meme detection tailored to the few-shot learning setting.

For VQA, we study it under the zero-shot transfer setting. Notably, previous zero-shot VQA models overlooked the explicit consideration of multi-step reasoning chains inherent in VQA. To address this oversight, We introduce a modularized zero-shot network that explicitly decomposes questions into sub-reasoning steps, converts sub-reasoning tasks to objectives suitable for PTMs, and assigns tasks to appropriate PTMs without adaptation.

Expanding our investigation, we delve into a specific VQA scenario known as knowledge-based VQA (K-VQA). In K-VQA, apart from an image, external knowledge is indispensable for answering the given questions. Recent approaches have utilized pre-trained large language models (LLMs) as both a knowledge source and a zero-shot QA model for K-VQA. However, these recent methods lack explicit demonstration of the knowledge needed to answer questions and thus lack interpretability. To rectify this deficiency, we propose KGENVQA, which first generates knowledge from a frozen LLM and subsequently leverages another frozen LLM for question answering with the incorporation of the generated knowledge. Finally, we conclude the thesis with a summary of our contributions and a discussion of potential future directions regarding the application of PTMs to VLU.

Keywords

Vision-language understanding, Visual question answering, Hateful meme detection, Pre-trained models

Degree Awarded

PhD in Information Systems

Discipline

Computer Sciences | Programming Languages and Compilers

Supervisor(s)

JIANG, Jing

First Page

Last Page

217

Publisher

Singapore Management University

City or Country

Singapore

Citation

CAO, Rui. Using pre-trained models for vision-language understanding tasks. (2024). 1-217.
Available at: https://ink.library.smu.edu.sg/etd_coll/595

Copyright Owner and License

Author

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Download

Included in

Programming Languages and Compilers Commons

COinS

Dissertations and Theses Collection (Open Access)

Using pre-trained models for vision-language understanding tasks

Publication Type

Version

Publication Date

Abstract

Keywords

Degree Awarded

Discipline

Supervisor(s)

First Page

Last Page

Publisher

City or Country

Citation

Copyright Owner and License

Creative Commons License

Included in

Search

Links

Browse

Links

Dissertations and Theses Collection (Open Access)

Using pre-trained models for vision-language understanding tasks

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Degree Awarded

Discipline

Supervisor(s)

First Page

Last Page

Publisher

City or Country

Citation

Copyright Owner and License

Creative Commons License

Included in

Share

Search

Links

Browse

Links