Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

7-2022

Abstract

In recent years, the pre-training-then-fine-tuning paradigm has yielded immense success on a wide spectrum of cross-modal tasks, such as visual question answering (VQA), in which a visual-language (VL) model is first optimized via self-supervised task objectives, e.g., masked language modeling (MLM) and image-text matching (ITM), and then fine-tuned to adapt to downstream task (e.g., VQA) via a brand-new objective function, e.g., answer prediction. However, the inconsistency of the objective forms not only severely limits the generalization of pre-trained VL models to downstream tasks, but also requires a large amount of labeled data for fine-tuning. To alleviate the problem, we propose an innovative VL fine-tuning paradigm (named Declaration-based Prompt Tuning, abbreviated as DPT), which fine-tunes the model for downstream VQA using the pre-training objectives, boosting the effective adaptation of pre-trained models to the downstream task. Specifically, DPT reformulates the VQA task via (1) textual adaptation, which converts the given questions into declarative sentence form for prompt-tuning, and (2) task adaptation, which optimizes the objective function of VQA problem in the manner of pre-training phase. Experimental results on GQA dataset show that DPT outperforms the fine-tuned counterpart by a large margin regarding accuracy in both fully-supervised (2.68%) and zero-shot/fewshot (over 31%) settings. All the data and codes will be available to facilitate future research.

Keywords

Machine Learning: Multi-modal learning, Computer Vision: Transfer, low-shot, semi- and un- supervised learning, Computer Vision: Vision and language, Natural Language Processing: Question Answering

Discipline

Databases and Information Systems

Research Areas

Data Science and Engineering

Publication

Proceedings of the 2022 International Joint Conference on Artificial Intelligence, Vienna, Austria, July 23-29

First Page

3264

Last Page

3270

Identifier

10.24963/ijcai.2022/453

Publisher

International Joint Conferences on Artificial Intelligence

City or Country

California

Citation

LIU, Yuhang; WEI, Wei; ZHU, Feida; and ZHU, Feida. Declaration-based prompt tuning for visual question answering. (2022). Proceedings of the 2022 International Joint Conference on Artificial Intelligence, Vienna, Austria, July 23-29. 3264-3270.
Available at: https://ink.library.smu.edu.sg/sis_research/7752

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.24963/ijcai.2022/453

Download

Included in

Databases and Information Systems Commons

COinS

Research Collection School Of Computing and Information Systems

Declaration-based prompt tuning for visual question answering

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

First Page

Last Page

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Declaration-based prompt tuning for visual question answering

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

First Page

Last Page

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links