Publication Type
Conference Proceeding Article
Version
acceptedVersion
Publication Date
12-2023
Abstract
Accurate detection of semantic code clones has many applications in software engineering but is challenging because of lexical, syntactic, or structural dissimilarities in code. CodeBERT, a popular deep neural network based pre-trained code model, can detect code clones with a high accuracy. However, its performance on unseen data is reported to be lower. A challenge is to interpret CodeBERT's clone detection behavior and isolate the causes of mispredictions. In this paper, we evaluate CodeBERT and interpret its clone detection behavior on the SemanticCloneBench dataset focusing on Java and Python clone pairs. We introduce the use of a black-box model interpretation technique, SHAP, to identify the core features of code that CodeBERT pays attention to for clone prediction. We first perform a manual similarity analysis over a sample of clone pairs to revise clone labels and to assign labels to statements indicating their contribution to core functionality. We then evaluate the correlation between the human and model's interpretation of core features of code as a measure of CodeBERT's trustworthiness. We observe only a weak correlation. Finally, we present examples on how to identify causes of mispredictions for CodeBERT. Our technique can help researchers to assess and fine-tune their models' performance.
Keywords
Codes, Correlation, Semantics, Cloning, Predictive models, Syntactics, Software reliability, Explainable AI, Model Interpretation, Black- box, Semantic Clone Detection, Code Model, Deep Learning
Discipline
Software Engineering
Research Areas
Software and Cyber-Physical Systems
Publication
2023 30th Asia-Pacific Software Engineering Conference (APSEC): Seoul, December 4-7: Proceedings
First Page
229
Last Page
238
ISBN
9798350344172
Identifier
10.1109/APSEC60848.2023.00033
Publisher
IEEE
City or Country
Pistacataway
Citation
ABID, Shamsa; CAI, Xuemeng; and JIANG, Lingxiao.
Interpreting CodeBERT for semantic code clone detection. (2023). 2023 30th Asia-Pacific Software Engineering Conference (APSEC): Seoul, December 4-7: Proceedings. 229-238.
Available at: https://ink.library.smu.edu.sg/sis_research/9313
Copyright Owner and License
Authors
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1109/APSEC60848.2023.00033