Cost-effective adversarial attacks against Code LLM with model attention
Publication Type
Journal Article
Publication Date
2-2026
Abstract
Code LLMs (CLLMs) are vulnerable to adversarial attacks, where semantically identical code mutations mislead models into incorrect predictions. To address this, adversarial training has been proposed, retraining models with adversarial examples generated by attack methods. Among various attack approaches, black-box methods have attracted increasing attention due to their flexibility and applicability. However, existing black-box attack methods face two key challenges: 1) vast mutation spaces limit attack efficiency and effectiveness, and 2) resource-intensive model queries constrain scalability. These challenges hinder the practicality of black-box attacks, especially under resource constraints, prompting the critical question: Can we enhance the efficiency of existing attack methods without compromising their effectiveness? To answer this, we conduct an empirical study using Explainable AI (XAI) techniques to investigate differences between adversarial and non-adversarial (failure) examples. After analyzing state-of-the-art attack methods against two CLLMs, we introduce the concept of model attention deviation, which quantifies differences in the model’s focus between unmutated (original) and mutated code. Our findings reveal that adversarial examples exhibit significant attention deviations, with the direction of deviation critically affecting attack success. Building on these insights, we propose ADVSEL, an efficient adversarial attack framework comprising two proxy components: the Attention Proxy Model (APM), which quickly estimates attention deviations to filter unpromising mutations, and the Deviation Direction Proxy Model (DDPM), which assesses whether attention shifts lead toward incorrect predictions. By integrating these proxy models with existing attack methods, ADVSEL effectively prioritizes promising mutations, significantly improving attack efficiency. Experimental evaluations across five CLLMs, four downstream tasks, and three attack methods demonstrate that ADVSEL maintains comparable attack success rates (a slight ASR reduction of 0.62%–0.70%) while significantly reducing model queries (by 34.98%–42.91%) and runtime (by 20.84%–21.45%). Under resource constraints, ADVSEL consistently outperforms baselines, highlighting its practical advantage in cost-effective adversarial evaluation.
Keywords
Adversarial Attack, Code LLM, Explainability
Discipline
Artificial Intelligence and Robotics | Information Security | Software Engineering
Research Areas
Software and Cyber-Physical Systems
Publication
IEEE Transactions on Software Engineering
First Page
1
Last Page
19
ISSN
0098-5589
Identifier
10.1109/TSE.2026.3663143
Publisher
Institute of Electrical and Electronics Engineers
Citation
SUN, Weifeng; HUANG, Naiqi; YAN, Meng; HUANG, Li; LIU, Zhangxin; LIU, Xiao; and LO, David.
Cost-effective adversarial attacks against Code LLM with model attention. (2026). IEEE Transactions on Software Engineering. 1-19.
Available at: https://ink.library.smu.edu.sg/sis_research/11038
Additional URL
https://doi.org/10.1109/TSE.2026.3663143