Publication Type
Conference Proceeding Article
Version
acceptedVersion
Publication Date
5-2007
Abstract
Detecting code clones has many software engineering applications. Existing approaches either do not scale to large code bases or are not robust against minor code modifications. In this paper, we present an efficient algorithm for identifying similar subtrees and apply it to tree representations of source code. Our algorithm is based on a novel characterization of subtrees with numerical vectors in the Euclidean Rn and an efficient algorithm to cluster these vectors w.r.t. the Euclidean distance metric. Subtrees with vectors in one cluster are considered similar. We have implemented our tree similarity algorithm as a clone detection tool called DECKARD and evaluated it on large code bases written in C and Java including the Linux kernel and JDK. Our experiments show that DECKARD is both scalable and accurate. It is also language independent, applicable to any language with a formally specified grammar.
Keywords
detection of code clones, software engineering applications, efficient algorithm
Discipline
Software Engineering
Research Areas
Software and Cyber-Physical Systems
Publication
ICSE 2007: Proceedings of the 29th International Conference on Software Engineering: Minneapolis, 20-26 May 2007
First Page
96
Last Page
105
ISBN
9780769528281
Identifier
10.1109/ICSE.2007.30
Publisher
IEEE Computer Society
City or Country
Los Alamitos, CA
Citation
JIANG, Lingxiao; MISHERGHI, Ghassan; SU, Zhendong; and GLONDU, Stephane.
DECKARD: Scalable and accurate tree-based detection of code clones. (2007). ICSE 2007: Proceedings of the 29th International Conference on Software Engineering: Minneapolis, 20-26 May 2007. 96-105.
Available at: https://ink.library.smu.edu.sg/sis_research/1011
Copyright Owner and License
Authors
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
http://doi.ieeecomputersociety.org/10.1109/ICSE.2007.30