Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

4-2024

Abstract

The availability of large-scale datasets, advanced architectures, and powerful computational resources have led to effective code models that automate diverse software engineering activities. The datasets usually consist of billions of lines of code from both open-source and private repositories. A code model memorizes and produces source code verbatim, which potentially contains vulnerabilities, sensitive information, or code with strict licenses, leading to potential security and privacy issues.This paper investigates an important problem: to what extent do code models memorize their training data? We conduct an empirical study to explore memorization in large pre-trained code models. Our study highlights that simply extracting 20,000 outputs (each having 512 tokens) from a code model can produce over 40,125 code snippets that are memorized from the training data. To provide a better understanding, we build a taxonomy of memorized contents with 3 categories and 14 subcategories. The results show that the prompts sent to the code models affect the distribution of memorized contents. We identify several key factors of memorization. Specifically, given the same architecture, larger models suffer more from memorization problem. A code model produces more memorization when it is allowed to generate longer outputs. We also find a strong positive correlation between the number of an output's occurrences in the training data and that in the generated outputs, which indicates that a potential way to reduce memorization is to remove duplicates in the training data. We then identify effective metrics that infer whether an output contains memorization accurately. We also make suggestions to deal with memorization.

Keywords

Open-Source Software, Memorization, Code Generation

Discipline

Programming Languages and Compilers | Software Engineering

Research Areas

Software and Cyber-Physical Systems

Publication

ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, Lisbon, Portugal, April 14-20

First Page

Last Page

ISBN

9798400702174

Identifier

10.1145/3597503.363907

Publisher

ACM

City or Country

New York

Citation

YANG, Zhou; ZHAO, Zhipeng; WANG, Chenyu; SHI, Jieke; KIM, Dongsun; HAN, DongGyun; and LO, David. Unveiling memorization in code models. (2024). ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, Lisbon, Portugal, April 14-20. 1-13.
Available at: https://ink.library.smu.edu.sg/sis_research/9246

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1145/3597503.363907

Download

Included in

Programming Languages and Compilers Commons, Software Engineering Commons

COinS

Research Collection School Of Computing and Information Systems

Unveiling memorization in code models

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Unveiling memorization in code models

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links