Modality-specific interactive attack for vision-language pre-training models
Publication Type
Journal Article
Publication Date
5-2025
Abstract
Recent advances have heightened the interest in the adversarial transferability of Vision-Language Pre-training (VLP) models. However, most existing strategies constrained by two persistent limitations: suboptimal utilization of crossmodal interactive information, and inherent discrepancies across hierarchical textual representation. To address these challenges, we propose the Modality-Specific Interactive Attack (MSIAttack), a novel approach that integrates semantic-level image perturbations with embedding-level text perturbations, all while maintaining minimal inter-modal constraints. In our image attack methodology, we introduce Multi-modal Integrated Gradients (MIG) to guide perturbations toward the core semantics of images, enriched by their associated deeply text information. This technique enhances transferability by capturing consistent features across various models, thereby effectively misleading similar-model perception areas. Additionally, we employ a momentum iteration strategy in conjunction with MIG, which amalgamates current and historical gradients to expedite the perturbation updates. For text attacks, we streamline the perturbation process by operating exclusively at the embedding level. This reduces semantic gaps across hierarchical structures and significantly enhances the generalizability of adversarial text. Moreover, we delve deeper into how semantic perturbations with varying degrees of similarity affect the overall attack effectiveness. Our experimental results on image-text retrieval tasks using the multi-modal datasets Flickr30K and MSCOCO underscore the efficacy of MSI-Attack. Our method achieves superior performance, setting a new state-of-the-art benchmark, all without the need for additional mechanisms.
Keywords
Modality-Specific Interactive, Multi-modal Adversarial Attack, Multi-modal Integrated Gradients
Discipline
Information Security
Research Areas
Cybersecurity
Publication
IEEE Transactions on Information Forensics and Security
Volume
20
First Page
5663
Last Page
5677
ISSN
1556-6013
Identifier
10.1109/TIFS.2025.3574976
Publisher
Institute of Electrical and Electronics Engineers
Citation
ZHANG, Haiqi; TANG, Hao; SUN, Yanpeng; HE, Shengfeng; and LI, Zechao.
Modality-specific interactive attack for vision-language pre-training models. (2025). IEEE Transactions on Information Forensics and Security. 20, 5663-5677.
Available at: https://ink.library.smu.edu.sg/sis_research/10240
Additional URL
https://doi.org/10.1109/TIFS.2025.3574976