Publication Type

Conference Proceeding Article

Version

submittedVersion

Publication Date

12-2024

Abstract

Since the rapid development of Large Language Models (LLMs) has achieved remarkable success, understanding and rectifying their internal complex mechanisms has become an urgent issue. Recent research has attempted to interpret their behaviors through the lens of inner representation. However, developing practical and efficient methods for applying these representations for general and flexible model editing remains challenging. In this work, we explore how to leverage insights from representation engineering to guide the editing of LLMs by deploying a representation sensor as an editing oracle. We first identify the importance of a robust and reliable sensor during editing, then propose an Adversarial Representation Engineering (ARE) framework to provide a unified and interpretable approach for conceptual model editing without compromising baseline performance. Experiments on multiple tasks demonstrate the effectiveness of ARE in various model editing scenarios. Our code and data are available at https://github.com/ Zhang-Yihao/Adversarial-Representation-Engineering.

Discipline

Software Engineering

Research Areas

Software and Cyber-Physical Systems

Areas of Excellence

Digital transformation

Publication

Proceedings of the 38th Conference on Neural Information Processing (NeurIPS 2024), Vancouver, Canada, December 10-15

First Page

Last Page

Identifier

10.48550/arXiv.2404.13752

City or Country

Citation

ZHANG, Yihao; WEI, Zeming; SUN, Jun; and SUN, Meng. Towards general conceptual model editing via adversarial representation engineering. (2024). Proceedings of the 38th Conference on Neural Information Processing (NeurIPS 2024), Vancouver, Canada, December 10-15. 1-22.
Available at: https://ink.library.smu.edu.sg/sis_research/9833

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.48550/arXiv.2404.13752

Download

Included in

Software Engineering Commons

COinS

Research Collection School Of Computing and Information Systems

Towards general conceptual model editing via adversarial representation engineering

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Areas of Excellence

Publication

First Page

Last Page

Identifier

City or Country

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Towards general conceptual model editing via adversarial representation engineering

Author

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Areas of Excellence

Publication

First Page

Last Page

Identifier

City or Country

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links