Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

2-2024

Abstract

Fine-grained visual classification (FGVC) involves categorizing fine subdivisions within a broader category, which poses challenges due to subtle inter-class discrepancies and large intra-class variations. However, prevailing approaches primarily focus on uni-modal visual concepts. Recent advancements in pre-trained vision-language models have demonstrated remarkable performance in various high-level vision tasks, yet the applicability of such models to FGVC tasks remains uncertain. In this paper, we aim to fully exploit the capabilities of cross-modal description to tackle FGVC tasks and propose a novel multimodal prompting solution, denoted as MP-FGVC, based on the contrastive language-image pertaining (CLIP) model. Our MP-FGVC comprises a multimodal prompts scheme and a multimodal adaptation scheme. The former includes Subcategory-specific Vision Prompt (SsVP) and Discrepancy-aware Text Prompt (DaTP), which explicitly highlights the subcategory-specific discrepancies from the perspectives of both vision and language. The latter aligns the vision and text prompting elements in a common semantic space, facilitating cross-modal collaborative reasoning through a Vision-Language Fusion Module (VLFM) for further improvement on FGVC. Moreover, we tailor a two-stage optimization strategy for MP-FGVC to fully leverage the pre-trained CLIP model and expedite efficient adaptation for FGVC. Extensive experiments conducted on four FGVC datasets demonstrate the effectiveness of our MP-FGVC.

Keywords

Fine-grained visual classification, Categorization, Multimodal prompts, Optimization strategy

Discipline

Artificial Intelligence and Robotics | Graphics and Human Computer Interfaces | Software Engineering

Research Areas

Software and Cyber-Physical Systems

Publication

Proceedings of the 38th AAAI Conference on Artificial Intelligence, AAAI 2024, Vancouver, February 20-27

Volume

First Page

2570

Last Page

2578

ISBN

9781577358879

Identifier

10.1609/aaai.v38i3.28034

Publisher

AAAI

City or Country

Palo Alto, CA

Citation

JIANG, Xin; TANG, Hao; GAO, Junyao; DU, Xiaoyu; HE, Shengfeng; and LI, Zechao. Delving into multimodal prompting for fine-grained visual classification. (2024). Proceedings of the 38th AAAI Conference on Artificial Intelligence, AAAI 2024, Vancouver, February 20-27. 38, 2570-2578.
Available at: https://ink.library.smu.edu.sg/sis_research/8741

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1609/aaai.v38i3.28034

Download

Included in

Artificial Intelligence and Robotics Commons, Graphics and Human Computer Interfaces Commons, Software Engineering Commons

COinS

Research Collection School Of Computing and Information Systems

Delving into multimodal prompting for fine-grained visual classification

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Delving into multimodal prompting for fine-grained visual classification

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links