Research Collection School Of Computing and Information Systems

Modality-specific interactive attack for vision-language pre-training models

Haiqi ZHANG
Hao TANG
Yanpeng SUN
Shengfeng HE, Singapore Management UniversityFollow
Zechao LI

Publication Type

Journal Article

Publication Date

5-2025

Abstract

Recent advances have heightened the interest in the adversarial transferability of Vision-Language Pre-training (VLP) models. However, most existing strategies constrained by two persistent limitations: suboptimal utilization of crossmodal interactive information, and inherent discrepancies across hierarchical textual representation. To address these challenges, we propose the Modality-Specific Interactive Attack (MSIAttack), a novel approach that integrates semantic-level image perturbations with embedding-level text perturbations, all while maintaining minimal inter-modal constraints. In our image attack methodology, we introduce Multi-modal Integrated Gradients (MIG) to guide perturbations toward the core semantics of images, enriched by their associated deeply text information. This technique enhances transferability by capturing consistent features across various models, thereby effectively misleading similar-model perception areas. Additionally, we employ a momentum iteration strategy in conjunction with MIG, which amalgamates current and historical gradients to expedite the perturbation updates. For text attacks, we streamline the perturbation process by operating exclusively at the embedding level. This reduces semantic gaps across hierarchical structures and significantly enhances the generalizability of adversarial text. Moreover, we delve deeper into how semantic perturbations with varying degrees of similarity affect the overall attack effectiveness. Our experimental results on image-text retrieval tasks using the multi-modal datasets Flickr30K and MSCOCO underscore the efficacy of MSI-Attack. Our method achieves superior performance, setting a new state-of-the-art benchmark, all without the need for additional mechanisms.

Keywords

Modality-Specific Interactive, Multi-modal Adversarial Attack, Multi-modal Integrated Gradients

Discipline

Information Security

Research Areas

Cybersecurity

Publication

IEEE Transactions on Information Forensics and Security

Volume

First Page

5663

Last Page

5677

ISSN

1556-6013

Identifier

10.1109/TIFS.2025.3574976

Publisher

Institute of Electrical and Electronics Engineers

Citation

ZHANG, Haiqi; TANG, Hao; SUN, Yanpeng; HE, Shengfeng; and LI, Zechao. Modality-specific interactive attack for vision-language pre-training models. (2025). IEEE Transactions on Information Forensics and Security. 20, 5663-5677.
Available at: https://ink.library.smu.edu.sg/sis_research/10240

Additional URL

https://doi.org/10.1109/TIFS.2025.3574976

This document is currently not available here.

COinS

Research Collection School Of Computing and Information Systems

Modality-specific interactive attack for vision-language pre-training models

Publication Type

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Additional URL

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Modality-specific interactive attack for vision-language pre-training models

Author

Publication Type

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Additional URL

Share

Search

Links

Browse

Links