Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
6-2025
Abstract
Large Language Models (LLMs) have achieved remarkable success in various applications, particularly in code-related tasks such as code generation and program repair, setting new performance benchmarks. However, the extensive use of large training corpora raises concerns about whether these achievements stem from genuine understanding or mere memorization of training data—a question often overlooked in current research. This paper aims to study the memorization issue within LLM-based program repair by investigating whether the correct patches generated by LLMs are the result of memorization. The key challenge lies in the absence of ground truth for confirming memorization, leading to various ad-hoc methods designed for its detection. To address this challenge, we first propose a general framework that formalizes memorization detection as a general hypothesis testing problem, where existing approaches can be unified by defining a low-probability event under the null hypothesis that the data is not memorized. The occurrence of such an event leads to the rejection of the null hypothesis, indicating potential memorization.Based on this framework, we design two specific methods (i.e., low-probability events) to detect potential memorization: 1) basic ground-truth matching, and 2) reassessment after substantial code mutation. We investigate the memorization issue in LLM-based program repair using two datasets: Defects4J, a widely used benchmark that is likely included in the training data, and GitBug-Java, a new dataset that is unlikely to be part of the training data. Our findings reveal that a significant portion of correct patches exactly match the ground truths in Defects4J (e.g., 78.83% and 87.42% on GPT-3.5 and CodeLlama-7b, respectively). Moreover, even after significant modifications to the buggy code, where the original repairs should not be generated, a considerable percentage of bugs (e.g., 81.82% on GPT-3.5 and 88.24% on CodeLlama-7b) continue to be fixed exactly as in the original bug fixes, indicating a high likelihood of memorization. Furthermore, we evaluate existing memorization detection methods and demonstrate their ineffectiveness in this context (e.g., most AUROCs are below 0.5). The theoretical analysis under our hypothesis testing framework shows that their defined events may not meet the requirements for being low-probability. The study highlights the critical need for more robust and rigorous evaluations in LLM-based software engineering research, ensuring a clear distinction between true problem-solving capabilities and mere memorization.
Keywords
Program Repair, Code Memorization
Discipline
Artificial Intelligence and Robotics | Software Engineering
Research Areas
Intelligent Systems and Optimization
Areas of Excellence
Digital transformation
Publication
Proceedings of the ACM on Software Engineering, Volume 2, Issue FSE, Trondheim, Norway, 2025 June 23-27
First Page
2712
Last Page
2734
Identifier
10.1145/3729390
Publisher
ACM
City or Country
New York
Citation
KONG, Jiaolong; XIE, Xiaofei; and LIU, Shangqing.
Demystifying memorization in LLM-based program repair via a general hypothesis testing framework. (2025). Proceedings of the ACM on Software Engineering, Volume 2, Issue FSE, Trondheim, Norway, 2025 June 23-27. 2712-2734.
Available at: https://ink.library.smu.edu.sg/sis_research/10323
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://dl.acm.org/doi/10.1145/3729390