Publication Type

Conference Proceeding Article

Version

acceptedVersion

Publication Date

11-2023

Abstract

The large-scale visual-language pre-trained model, Contrastive Language-Image Pre-training (CLIP), has significantly improved image captioning for scenarios without human-annotated image-caption pairs. Recent advanced CLIP-based image captioning without human annotations follows a text-only training paradigm, i.e., reconstructing text from shared embedding space. Nevertheless, these approaches are limited by the training/inference gap or huge storage requirements for text embeddings. Given that it is trivial to obtain images in the real world, we propose CLIP-guided text GAN (CgT-GAN), which incorporates images into the training process to enable the model to "see" real visual modality. Particularly, we use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus and CLIP-based reward to provide semantic guidance. The caption generator is jointly rewarded based on the caption naturalness to human language calculated from the GAN's discriminator and the semantic guidance reward computed by the CLIP-based reward module. In addition to the cosine similarity as the semantic guidance reward (i.e., CLIP-cos), we further introduce a novel semantic guidance reward called CLIP-agg, which aligns the generated caption with a weighted text embedding by attentively aggregating the entire corpus. Experimental results on three subtasks (ZS-IC, In-UIC and Cross-UIC) show that CgT-GAN outperforms state-of-the-art methods significantly across all metrics. Code is available at https://github.com/Lihr747/CgtGAN.

Keywords

Image captioning, CLIP, Reinforcement learning, GAN

Discipline

Graphics and Human Computer Interfaces | Programming Languages and Compilers

Research Areas

Intelligent Systems and Optimization

Areas of Excellence

Digital transformation

Publication

MM'23: Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, Canada, October 29 - November 3

First Page

2252

Last Page

2263

Identifier

10.1145/3581783.3611891

Publisher

ACM

City or Country

New York

Citation

YU, Jiarui; LI, Haoran; HAO, Yanbin; ZHU, Bin; XU, Tong; and HE, Xiangnan. CgT-GAN: CLIP-guided text GAN for image captioning. (2023). MM'23: Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, Canada, October 29 - November 3. 2252-2263.
Available at: https://ink.library.smu.edu.sg/sis_research/9012

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1145/3581783.3611891

Download

Included in

Graphics and Human Computer Interfaces Commons, Programming Languages and Compilers Commons

COinS

Research Collection School Of Computing and Information Systems

CgT-GAN: CLIP-guided text GAN for image captioning

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Areas of Excellence

Publication

First Page

Last Page

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

CgT-GAN: CLIP-guided text GAN for image captioning

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Areas of Excellence

Publication

First Page

Last Page

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links