Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
10-2022
Abstract
Large pre-trained models have dramatically improved the state-of-the-art on a variety of natural language processing (NLP) tasks. CodeBERT is one such pre-trained model for natural language (NL) and programming language (PL) which captures the semantics in natural language and programming language, and produces general-purpose representations. While it has been shown to support natural language code search and code documentation generation tasks, its effectiveness for code clone detection is not explored in depth. In this paper, we aim to replicate and evaluate the performance of CodeBERT for code clone detection on multiple datasets with varying functionalities to understand (1) whether CodeBERT can generalize to unseen code, (2) how fine-tuning can effect CodeBERT’s performance on unseen code, and (3) how CodeBERT performs for detecting various code clone types. To this end, we consider three different datasets of Java methods. We derive the first dataset from Big-CloneBench. We use Java clone pairs from SemanticCloneBench to derive our second dataset, and our third dataset consists of Java methods from Android applications. Our experiments indicate that CodeBERT performs the best for detecting Type-1 and Type-4 clones with a 100% and 96% recall on average respectively. We also find that there is limited generalizability on unseen functionalities where recall drops by 15% and 40% on the SemanticCloneBench and Android datasets respectively. Furthermore, we observe that fine-tuning can improve the recall by 22% and 30% on the SemanticCloneBench and Android datasets respectively.
Keywords
Code Clone Detection, Semantic Code Clones, Deep-learning, CodeBERT, BigCloneBench, SemanticCloneBench, Android
Discipline
Software Engineering
Publication
Proceedings of the 2022 IEEE 16th International Workshop on Software Clones (IWSC), Limassol, Cyprus, October 2
Last Page
39
ISBN
9781665484473
Identifier
10.1109/IWSC55060.2022.00015
Publisher
IEEE
City or Country
Los Alamitos, CA
Citation
ARSHAD, Saad; ABID, Shamsa; and SHAMAIL, Shafay.
CodeBERT for code clone detection: A replication study. (2022). Proceedings of the 2022 IEEE 16th International Workshop on Software Clones (IWSC), Limassol, Cyprus, October 2. 39.
Available at: https://ink.library.smu.edu.sg/sis_research/10175
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1109/IWSC55060.2022.00015