Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
6-2023
Abstract
Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization capability which is critical for many downstream tasks such as visual reasoning. In this work, we propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with VLP. Specifically, in the VLP phase, PTP divides the image into N x N blocks, and identifies the objects in each block through the widely used object detector in VLP. It then reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object, e.g. filling “[P]” or “[O]” in a PTP “The block [P] has a [O]”. This mechanism improves the visual grounding capability of VLP models and thus helps them better handle various downstream tasks. By introducing PTP into several state-of-the-art VLP frameworks, we observe consistently significant improvements across representative cross-modal learning model architectures and several benchmarks, e.g. zero-shot Flickr30K Retrieval (+4.8 in average recall@1) for ViLT [16] baseline, and COCO Captioning (+5.3 in CIDEr) for SOTA BLIP [19] baseline. Moreover, PTP achieves comparable results with object-detector based methods [8, 23, 45], and much faster inference speed since PTP discards its object detector for inference while the later cannot.
Discipline
Graphics and Human Computer Interfaces | Programming Languages and Compilers
Research Areas
Intelligent Systems and Optimization
Areas of Excellence
Digital transformation
Publication
Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, June 17-24
First Page
23242
Last Page
23251
ISBN
9798350301304
Identifier
10.1109/CVPR52729.2023.02226
Publisher
IEEE
City or Country
Piscataway, NJ
Citation
WANG, Alex Jinpeng; ZHOU, Pan; SHOU, Mike Zheng; and YAN Shuicheng.
Position-guided text prompt for vision-language pre-training. (2023). Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, June 17-24. 23242-23251.
Available at: https://ink.library.smu.edu.sg/sis_research/9021
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1109/CVPR52729.2023.02226
Included in
Graphics and Human Computer Interfaces Commons, Programming Languages and Compilers Commons