Publication Type

Journal Article

Version

acceptedVersion

Publication Date

12-2024

Abstract

Leveraging large-scale datasets from open-source projects and advances in large language models, recent progress has led to sophisticated code models for key software engineering tasks, such as program repair and code completion. These models are trained on data from various sources, including public open-source projects like GitHub and private, confidential code from companies, raising significant privacy concerns. This paper investigates a crucial but unexplored question: What is the risk of membership information leakage in code models? Membership leakage refers to the vulnerability where an attacker can infer whether a specific data point was part of the training dataset. We present Gotcha , a novel membership inference attack method designed for code models, and evaluate its effectiveness on Java-based datasets. Gotcha simultaneously considers three key factors: model input, model output, and ground truth. Our ablation study confirms that each factor significantly enhances attack performance. Our ablation study confirms that each factor significantly enhances attack performance. Our investigation reveals a troubling finding: membership leakage risk is significantly elevated . While previous methods had accuracy close to random guessing, Gotcha achieves high precision, with a true positive rate of 0.95 and a low false positive rate of 0.10. We also demonstrate that the attacker's knowledge of the victim model (e.g., model architecture and pre-training data) affects attack success. Additionally, modifying decoding strategies can help reduce membership leakage risks. This research highlights the urgent need to better understand the privacy vulnerabilities of code models and develop strong countermeasures against these threats.

Keywords

Membership inference attack, Privacy, Large Langauge Models for code, Code completion

Discipline

Information Security | Numerical Analysis and Scientific Computing

Research Areas

Cybersecurity; Intelligent Systems and Optimization

Publication

IEEE Transactions on Software Engineering

Volume

Issue

First Page

3290

Last Page

3306

ISSN

0098-5589

Identifier

10.1109/TSE.2024.3482719

Publisher

Institute of Electrical and Electronics Engineers

Citation

YANG, Zhou; ZHAO, Zhipeng; WANG, Chenyu; SHI, Jieke; KIM, Dongsum; HAN, Donggyun; and LO, David. Gotcha ! This model uses my code ! Evaluating membership leakage risks in code models. (2024). IEEE Transactions on Software Engineering. 50, (12), 3290-3306.
Available at: https://ink.library.smu.edu.sg/sis_research/9889

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1109/TSE.2024.3482719

Download

Included in

Information Security Commons, Numerical Analysis and Scientific Computing Commons

COinS

Research Collection School Of Computing and Information Systems

Gotcha ! This model uses my code ! Evaluating membership leakage risks in code models

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Gotcha ! This model uses my code ! Evaluating membership leakage risks in code models

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links