Cost-effective adversarial attacks against Code LLM with model attention

Publication Type

Journal Article

Publication Date

2-2026

Abstract

Code LLMs (CLLMs) are vulnerable to adversarial attacks, where semantically identical code mutations mislead models into incorrect predictions. To address this, adversarial training has been proposed, retraining models with adversarial examples generated by attack methods. Among various attack approaches, black-box methods have attracted increasing attention due to their flexibility and applicability. However, existing black-box attack methods face two key challenges: 1) vast mutation spaces limit attack efficiency and effectiveness, and 2) resource-intensive model queries constrain scalability. These challenges hinder the practicality of black-box attacks, especially under resource constraints, prompting the critical question: Can we enhance the efficiency of existing attack methods without compromising their effectiveness? To answer this, we conduct an empirical study using Explainable AI (XAI) techniques to investigate differences between adversarial and non-adversarial (failure) examples. After analyzing state-of-the-art attack methods against two CLLMs, we introduce the concept of model attention deviation, which quantifies differences in the model’s focus between unmutated (original) and mutated code. Our findings reveal that adversarial examples exhibit significant attention deviations, with the direction of deviation critically affecting attack success. Building on these insights, we propose ADVSEL, an efficient adversarial attack framework comprising two proxy components: the Attention Proxy Model (APM), which quickly estimates attention deviations to filter unpromising mutations, and the Deviation Direction Proxy Model (DDPM), which assesses whether attention shifts lead toward incorrect predictions. By integrating these proxy models with existing attack methods, ADVSEL effectively prioritizes promising mutations, significantly improving attack efficiency. Experimental evaluations across five CLLMs, four downstream tasks, and three attack methods demonstrate that ADVSEL maintains comparable attack success rates (a slight ASR reduction of 0.62%–0.70%) while significantly reducing model queries (by 34.98%–42.91%) and runtime (by 20.84%–21.45%). Under resource constraints, ADVSEL consistently outperforms baselines, highlighting its practical advantage in cost-effective adversarial evaluation.

Keywords

Adversarial Attack, Code LLM, Explainability

Discipline

Artificial Intelligence and Robotics | Information Security | Software Engineering

Research Areas

Software and Cyber-Physical Systems

Publication

IEEE Transactions on Software Engineering

First Page

1

Last Page

19

ISSN

0098-5589

Identifier

10.1109/TSE.2026.3663143

Publisher

Institute of Electrical and Electronics Engineers

Additional URL

https://doi.org/10.1109/TSE.2026.3663143

This document is currently not available here.

Share

COinS