Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
11-2025
Abstract
Despite extensive efforts in safety alignment, large language models (LLMs) remain vulnerable to jailbreak attacks. Activation steering offers a training-free defense method but relies on fixed steering coefficients, resulting in suboptimal protection and increased false rejections of benign inputs. To address this, we propose AdaSteer, an adaptive activation steering method that dynamically adjusts model behavior based on input characteristics. We identify two key properties: Rejection Law (R-Law), which shows that stronger steering is needed for jailbreak inputs opposing the rejection direction, and Harmfulness Law (H-Law), which differentiates adversarial and benign inputs. AdaSteer steers input representations along both the Rejection Direction (RD) and Harmfulness Direction (HD), with adaptive coefficients learned via logistic regression, ensuring robust jailbreak defense while preserving benign input handling. Experiments on LLaMA-3.1, Gemma-2, and Qwen2.5 show that AdaSteer outperforms baseline methods across multiple jailbreak attacks with minimal impact on utility. Our results highlight the potential of interpretable model internals for real-time, flexible safety enforcement in LLMs.
Discipline
Artificial Intelligence and Robotics
Research Areas
Intelligent Systems and Optimization
Areas of Excellence
Digital transformation
Publication
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Suzhou, China, November 4-9
First Page
24570
Last Page
24588
Identifier
10.18653/v1/2025.emnlp-main.1248
Publisher
ACL
City or Country
USA
Citation
ZHAO, Weixiang; GUO, Jiahe; HU, Yulin; DENG, Yang; ZHANG, An; SUI, Xingyu; HAN, Xinyang; ZHAO, Yanyan; QIN, Bing; CHUA, Tat-Seng; and LIU, Ting.
AdaSteer: Your aligned LLM is inherently an adaptive jailbreak defender. (2025). Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Suzhou, China, November 4-9. 24570-24588.
Available at: https://ink.library.smu.edu.sg/sis_research/10724
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://aclanthology.org/2025.emnlp-main.1248/