Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

11-2024

Abstract

Large language models (LLMs) are increasingly being adopted in a wide range of realworld applications. Despite their impressive performance, recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts even when aligned via Reinforcement Learning from Human Feedback or supervised fine-tuning. While existing defense methods focus on either detecting harmful prompts or reducing the likelihood of harmful responses through various means, defending LLMs against jailbreak attacks based on the inner mechanisms of LLMs remains largely unexplored. In this work, we investigate how LLMs respond to harmful prompts and propose a novel defense method termed Layer-specific Editing (LED) to enhance the resilience of LLMs against jailbreak attacks. Through LED, we reveal that several critical safety layers exist among the early layers of LLMs. We then show that realigning these safety layers (and some selected additional layers) with the decoded safe response from identified toxic layers can significantly improve the alignment of LLMs against jailbreak attacks. Extensive experiments across various LLMs (e.g., Llama2, Mistral) show the effectiveness of LED, which effectively defends against jailbreak attacks while maintaining performance on benign prompts. Our code is available at https://github.com/ledllm/ledllm.

Discipline

Software Engineering

Research Areas

Software and Cyber-Physical Systems

Areas of Excellence

Digital transformation

Publication

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024), Miami, Florida, November 12-16

First Page

5094

Last Page

5109

Identifier

10.48550/arXiv.2405.18166

City or Country

Citation

ZHAO, Wei; LI, Zhe; LI, Yige; SUN, Jun; and SUN, Jun. Defending large language models against jailbreak attacks via layer-specific editing. (2024). Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024), Miami, Florida, November 12-16. 5094-5109.
Available at: https://ink.library.smu.edu.sg/sis_research/9832

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.48550/arXiv.2405.18166

Download

Included in

Software Engineering Commons

COinS

Research Collection School Of Computing and Information Systems

Defending large language models against jailbreak attacks via layer-specific editing

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Areas of Excellence

Publication

First Page

Last Page

Identifier

City or Country

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Defending large language models against jailbreak attacks via layer-specific editing

Author

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Areas of Excellence

Publication

First Page

Last Page

Identifier

City or Country

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links