Publication Type

Journal Article

Version

acceptedVersion

Publication Date

7-2025

Abstract

Project-specific code completion, which aims to complete code based on the context of the project, is an important and practical software engineering task. The state-of-the-art approaches employ the retrieval-augmented generation (RAG) paradigm and prompt large language models (LLMs) with information retrieved from the target project for project-specific code completion. In practice, developers always define and use custom functionalities, namely internal APIs, to facilitate the implementation of specific project requirements. Thus, it is essential to consider internal API information for accurate project-specific code completion. However, existing approaches either retrieve similar code snippets, which do not necessarily contain related internal API information, or retrieve internal API information based on import statements, which usually do not exist when the related internal APIs haven’t been used in the file. Therefore, these project-specific code completion approaches face challenges in effectiveness or practicability. To this end, this paper aims to enhance project-specific code completion by locating internal API information without relying on import statements. We first propose a method to infer internal API information. Our method first extends the representation of each internal API by constructing its usage examples and functional semantic information (i.e., a natural language description of the function’s purpose) and constructs a knowledge base. Based on the knowledge base, our method uses an initial completion solution generated by LLMs to infer the API information necessary for completion. Based on this method, we propose a code completion approach that enhances project-specific code completion by integrating similar code snippets and internal API information. Furthermore, we developed a benchmark named ProjBench, which consists of recent, large-scale real-world projects and is free of leaked import statements. We evaluated the effectiveness of our approach on ProjBench and an existing benchmark CrossCodeEval. Experimental results show that our approach outperforms the base-performing approach by an average of +5.91 in code exact match and +6.26 in identifier exact match, corresponding to relative improvements of 22.72% and 18.31%, respectively. We also show our method complements existing ones by integrating it into various baselines, boosting code match by +7.77 (47.80%) and identifier match by +8.50 (35.55%) on average.

Keywords

Codes, Benchmark Testing, Semantics, Large Language Models, Knowledge Based Systems, Training, Retrieval Augmented Generation, Data Mining

Discipline

Software Engineering

Research Areas

Software and Cyber-Physical Systems

Areas of Excellence

Digital transformation

Publication

IEEE Transactions on Software Engineering

Volume

51

Issue

9

First Page

2566

Last Page

2582

ISSN

0098-5589

Identifier

10.1109/TSE.2025.3592823

Publisher

Institute of Electrical and Electronics Engineers

Additional URL

https://doi.org/10.1109/TSE.2025.3592823

Share

COinS