Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
11-2022
Abstract
Due to the emergence of large-scale codebases, such as GitHub and Gitee, searching and reusing existing code can help developers substantially improve software development productivity. Over the years, many code search tools have been developed. Early tools leveraged the information retrieval (IR) technique to perform an efficient code search for a frequently changed large-scale codebase. However, the search accuracy was low due to the semantic mismatch between query and code. In the recent years, many tools leveraged Deep Learning (DL) technique to address this issue. But the DL-based tools are slow and the search accuracy is unstable.In this paper, we presented an IR-based tool CodeMatcher, which inherits the advantages of the DL-based tool in query semantics matching. Generally, CodeMatcher builds indexing for a large-scale codebase at first to accelerate the search response time. For a given search query, it addresses irrelevant and noisy words in the query, then retrieves candidate code from the indexed codebase via iterative fuzzy search, and finally reranks the candidates based on two designed measures of semantic matching between query and candidates. We implemented CodeMatcher as a search engine website. To verify the effectiveness of our tool, we evaluated CodeMatcher on 41k+ open-source Java repositories. Experimental results showed that CodeMatcher can achieve an industrial-level response time (0.3s) with a common server with an Intel-i7 CPU. On the search accuracy, CodeMatcher significantly outperforms three state-of-the-art tools (DeepCS, UNIF, and CodeHow) and two online search engines (GitHub search and Google search).
Discipline
Databases and Information Systems | Programming Languages and Compilers | Software Engineering
Research Areas
Software and Cyber-Physical Systems
Publication
ESEC/FSE '22: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Singapore, Singapore, November 14-18
First Page
1642
Last Page
1646
ISBN
9781450394130
Identifier
10.1145/3540250.3558935
Publisher
ACM
City or Country
New York
Citation
LIU, Chao; BAO, Xuanlin; XIA, Xin; YAN, Meng; LO, David; and ZHANG, Ting.
CodeMatcher: A tool for large-scale code search based on query semantics matching. (2022). ESEC/FSE '22: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Singapore, Singapore, November 14-18. 1642-1646.
Available at: https://ink.library.smu.edu.sg/sis_research/7728
Copyright Owner and License
Publisher
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1145/3540250.3558935
Included in
Databases and Information Systems Commons, Programming Languages and Compilers Commons, Software Engineering Commons