Publication Type

PhD Dissertation

Version

publishedVersion

Publication Date

12-2024

Abstract

The field of software engineering has witnessed a surge in large language models specifically tailored to understand and process code, which we call large language models for code (LLM4Code). The increasing popularity of LLM4Code is inseparable from three key factors: the availability of extensive datasets compiled from diverse data sources, the advancements in deep learning algorithms and computational power that facilitate the training of these powerful models, and the active engagement and collaboration within the research community fostering innovation and the rapid exchange of ideas and methodologies. As evidenced by a series of studies, LLM4Code has been experiencing rapid development and achieving phenomenal success in various facets of software engineering. These powerful models have swiftly evolved from experimental prototypes to practical tools, integrating into the daily workflow of software developers around the globe.

However, we unveil that LLM4Code fails to satisfy many non-functional requirements. For example, LLM4Code may not be robust, is vulnerable to data poisoning attacks, and may leak sensitive information. The dissertation starts with the first systematic literature review on the non-functional properties of LLM4Code, which provides a comprehensive understanding of the current status of LLM4Code. We identify six important non-functional properties of LLM4Code, including robustness, security, privacy, usability, explainability, and e ffi ciency.

Then, this dissertation presents work on evaluating the robustness of LLM4Code. We highlight the naturalness requirements in operating adversarial attacks for LLM4Code. We propose using the masked language modeling capability of LLM4Code to generate adversarial perturbations that look natural to human developers. We design a two-step method to generate adversarial examples for

code. Evaluation results show our method can generate more natural adversarial examples than existing methods, and our method can provide stronger robustness enhancement than baselines.

The second work evaluates the security of LLM4Code, specifically the threats posed by backdoor attacks. In this work, we design a novel data poisoning method that can inject adaptive triggers into the training data of LLM4Code. The triggers are much stealthier than the baseline methods, capable of evading detection by both human developers and state-of-the-art defensive methods. Our evaluation results show that the stealthy backdoor attack can still achieve a high attack success rate.

The third work unveils the memorization phenomena in LLM4Code. As part of the dissertation, we show that simple strategies can make models produce a vast amount of outputs that are exactly memorized from their training data. It is worrying that the memorized contents contain many vulnerable code snippets, code with strong licenses, and code that leaks sensitive information like passwords and API keys. We also conduct a comprehensive empirical study to analyze the factors that affect memorization in LLM4Code.

The fourth work presents the first analysis of the membership information leakage risk of LLM4Code. We design a novel membership inference attack method that takes multiple pieces of information into consideration. The results show that our method achieves state-of-the-art performance in inferring the training data information. We also discuss the risks associated with this kind of attack and explain potential mitigation suggestions.

In the end, we report our latest research on constructing and analyzing the LLM4Code ecosystem. We identify key datasets, models, and users in the ecosystem and quantify their contribution and importance. We categorize LLM4Code model reuse into nine categories. Additionally, we examine documentation and licensing practices. To analyze the rapidly growing LLM4Code, we explore the potential of using LLMs to assist in constructing and analyzing the ecosystem. Based on our findings, we discuss the implications and suggestions to facilitate the healthy growth of LLM4Code.

Degree Awarded

PhD in Computer Science

Discipline

Information Security | Programming Languages and Compilers

Supervisor(s)

LO, David

First Page

Last Page

311

Publisher

Singapore Management University

City or Country

Singapore

Citation

YANG, Zhou. Towards robust, secure, and privacy-aware large language models of code. (2024). 1-311.
Available at: https://ink.library.smu.edu.sg/etd_coll/666

Copyright Owner and License

Author

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Download

Included in

Information Security Commons, Programming Languages and Compilers Commons

COinS

Dissertations and Theses Collection (Open Access)

Towards robust, secure, and privacy-aware large language models of code

Publication Type

Version

Publication Date

Abstract

Degree Awarded

Discipline

Supervisor(s)

First Page

Last Page

Publisher

City or Country

Citation

Copyright Owner and License

Creative Commons License

Included in

Search

Links

Browse

Links

Dissertations and Theses Collection (Open Access)

Towards robust, secure, and privacy-aware large language models of code

Author

Publication Type

Version

Publication Date

Abstract

Degree Awarded

Discipline

Supervisor(s)

First Page

Last Page

Publisher

City or Country

Citation

Copyright Owner and License

Creative Commons License

Included in

Share

Search

Links

Browse

Links