Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

7-2025

Abstract

Despite the success of Large Language Models (LLMs) across various fields, their potential to generate untruthful and harmful responses poses significant risks, particularly in critical applications. This highlights the urgent need for systematic methods to detect and prevent such misbehavior. While existing approaches target specific issues such as harmful responses, this work introduces LLMSCAN, an innovative LLM monitoring technique based on causality analysis, offering a comprehensive solution. LLMSCAN systematically monitors the inner workings of an LLM through the lens of causal inference, operating on the premise that the LLM’s ‘brain’ behaves differently when generating harmful or untruthful responses. By analyzing the causal contributions of the LLM’s input tokens and transformer layers, LLMSCAN effectively detects misbehavior. Extensive experiments across various tasks and models reveal clear distinctions in the causal distributions between normal behavior and misbehavior, enabling the development of accurate, lightweight detectors for a variety of misbehavior detection tasks.

Discipline

Software Engineering

Areas of Excellence

Digital transformation

Publication

Proceedings of the 42nd International Conference on Machine Learning, Vancouver, Canada, 2025 July 13-19

First Page

1

Last Page

30

Identifier

10.48550/arXiv.2410.16638

City or Country

Canada

Additional URL

https://doi.org/10.48550/arXiv.2410.16638

Share

COinS