Publication Type
PhD Dissertation
Version
publishedVersion
Publication Date
1-2026
Abstract
While Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) are at the frontier of current advancements in artificial intelligence, demonstrating remarkable capabilities across diverse applications, there are growing concerns about their reliability and security. LLMs remain vulnerable to adversarial attacks through carefully crafted prompts that circumvent safety mechanisms, while MLLMs face additional security challenges stemming from their multimodal nature. Despite considerable efforts in reinforcement learning from human feedback (RLHF) and supervised fine-tuning, existing safeguards have proven inadequate in addressing these critical vulnerabilities. This inadequacy stems from the fact that these models are inherently blackboxes that do not provide explanations on how and why decisions are made, making their security vulnerabilities more difficult to identify and eliminate. Addressing these security challenges fundamentally requires understanding the inner safety mechanisms of these models, as such understanding is essential for developing targeted mitigation strategies that can effectively defend against attacks in a rigorous way.
In this dissertation, we focus on understanding the inner safety mechanisms of LLMs and MLLMs and developing systematic ways to mitigate the risk of jailbreak attacks and adversarial visual inputs. Our research spans from causality analysis for security evaluation to layer-specific defense method and crossmodal safety alignment.
In the first research work, we propose CASPER, a framework for conducting lightweight causality analysis of LLMs at the token, layer, and neuron level. We applied our framework to open-source LLMs such as Llama2 and Vicuna, and had multiple interesting discoveries. Based on a layer-level causality analysis, we show that RLHF has the effect of overfitting a model to harmful prompts, which implies that such security can be easily overcome by 'unusual' harmful prompts. As evidence, we propose a jailbreak attack method termed emoji attack that achieves 100% attack success rate on the red-teaming tasks of the Trojan Detection Competition 2023. Furthermore, we show the existence of one mysterious neuron in both Llama2 and Vicuna that has an unreasonably high causal effect on the output. While we are uncertain on why such a neuron exists, we show that it is possible to conduct a "Trojan" attack targeting that particular neuron to completely cripple the LLM, i.e., we can generate transferable suffixes to prompts that frequently make the LLM produce meaningless responses.
In the second research work, we investigate how LLMs respond to harmful prompts and propose a novel defense method termed Layer-specific Editing (LED) to enhance the resilience of LLMs against jailbreak attacks. While existing defense methods focus on either detecting harmful prompts or post generation detection, defending LLMs against jailbreak attacks based on the inner mechanisms of LLMs remains largely unexplored. Through LED, we reveal that several critical safety layers exist among the early layers of LLMs. We then show that realigning these safety layers (and some selected additional layers) with the decoded safe response from identified toxic layers can significantly improve the alignment of LLMs against jailbreak attacks. Extensive experiments across various LLMs (e.g., Llama2, Mistral) show the effectiveness of LED, which effectively defends against jailbreak attacks while maintaining performance on benign prompts.
In the third research work, we address the vulnerability of MLLMs to harmful visual inputs despite their robust textual safety mechanisms. Existing safeguards, typically relying on pre-filtering or post-detection incur high costs and diminish overall utility. To address this critical vulnerability, we introduce SafeCLIP, a lightweight method that leverages MLLMs' inherent multimodal alignment for zero-shot toxic image detection. By projecting CLIP's discarded CLS token into its text space and matching it with toxic descriptors, SafeCLIP detects harmful content without any architectural changes, adding minimal latency and enabling dynamic safety corrections during inference and fine-tuning. Experiments show that SafeCLIP achieves a 66.9% defense success rate with only 3.2% false positive rate and 7.2% overhead. In contrast, state-of-the-art methods achieve 52.9% success but have a 10.7% false positive rate and 210% overhead. Our work demonstrates that leveraging inherent multimodal alignment can yield efficient, low-cost MLLM safety.
In the fourth research work, we introduce Q-MLLM, a novel architecture that integrates two-level vector quantization to create a discrete bottleneck against adversarial attacks while preserving multimodal reasoning capabilities. By discretizing visual representations at both pixel-patch and semantic levels, Q-MLLM blocks attack pathways and bridges the cross-modal safety alignment gap. Experiments demonstrate that Q-MLLM achieves perfect defense success rate (100%) against jailbreak attacks except in one arguable case, while maintaining competitive performance on multiple utility benchmarks with minimal inference overhead. This work establishes vector quantization as an effective defense mechanism for secure multimodal AI systems without requiring expensive safety-specific fine-tuning or detection overhead.
Keywords
LLM safety, MLLM safety, Causality Analysis
Degree Awarded
PhD in Information Systems
Discipline
Artificial Intelligence and Robotics | Computer Sciences
Supervisor(s)
SUN, Jun
First Page
1
Last Page
140
Publisher
Singapore Management University
City or Country
Singapore
Citation
ZHAO, Wei.
Inside out: Improving large model safety. (2026). 1-140.
Available at: https://ink.library.smu.edu.sg/etd_coll/828
Copyright Owner and License
Author
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.