Publication Type
Conference Proceeding Article
Version
submittedVersion
Publication Date
12-2024
Abstract
Since the rapid development of Large Language Models (LLMs) has achieved remarkable success, understanding and rectifying their internal complex mechanisms has become an urgent issue. Recent research has attempted to interpret their behaviors through the lens of inner representation. However, developing practical and efficient methods for applying these representations for general and flexible model editing remains challenging. In this work, we explore how to leverage insights from representation engineering to guide the editing of LLMs by deploying a representation sensor as an editing oracle. We first identify the importance of a robust and reliable sensor during editing, then propose an Adversarial Representation Engineering (ARE) framework to provide a unified and interpretable approach for conceptual model editing without compromising baseline performance. Experiments on multiple tasks demonstrate the effectiveness of ARE in various model editing scenarios. Our code and data are available at https://github.com/ Zhang-Yihao/Adversarial-Representation-Engineering.
Discipline
Software Engineering
Research Areas
Software and Cyber-Physical Systems
Areas of Excellence
Digital transformation
Publication
Proceedings of the 38th Conference on Neural Information Processing (NeurIPS 2024), Vancouver, Canada, December 10-15
First Page
1
Last Page
22
Identifier
10.48550/arXiv.2404.13752
City or Country
US
Citation
ZHANG, Yihao; WEI, Zeming; SUN, Jun; and SUN, Meng.
Towards general conceptual model editing via adversarial representation engineering. (2024). Proceedings of the 38th Conference on Neural Information Processing (NeurIPS 2024), Vancouver, Canada, December 10-15. 1-22.
Available at: https://ink.library.smu.edu.sg/sis_research/9833
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.48550/arXiv.2404.13752