Research Collection School Of Computing and Information Systems

InferCode: Self-supervised learning of code representations by predicting subtrees

Duy Quoc Nghi BUI, Singapore Management UniversityFollow
Yijun YU, Open University
Lingxiao JIANG, Singapore Management UniversityFollow

Publication Type

Conference Proceeding Article

Version

acceptedVersion

Publication Date

5-2021

Abstract

Learning code representations has found many uses in software engineering, such as code classification, code search, code comment generation, and bug prediction. Although representations of code in tokens, syntax trees, dependency graphs, paths in trees, or the combinations of their variants have been proposed, existing learning techniques have a major limitation that these models are often trained on datasets labeled for specific downstream tasks, and the code representations may not be suitable for other tasks. Even though some techniques generate representations from unlabeled code, they are far from satisfactory when applied to downstream tasks. To overcome the limitation, this paper proposes InferCode, which adapts the selfsupervised learning idea from natural language processing to the abstract syntax trees (ASTs) of code. The key novelty lies in the training of code representations by predicting subtrees automatically identified from the context of ASTs. With InferCode, subtrees in ASTs are treated as the labels for training the code representations without any human labeling effort or the overhead of expensive graph construction, and the trained representations are no longer tied to any specific downstream tasks or code units. We have trained an instance of InferCode model using TreeBased Convolutional Neural Network (TBCNN) as the encoder of a large set of Java code. This pre-trained model can then be applied to downstream unsupervised tasks such as code clustering, code clone detection, cross-language code search, or be reused under a transfer learning scheme to continue training the model weights for supervised tasks such as code classification and method name prediction. Comparing to prior techniques applied to the same downstream tasks, such as code2vec, code2seq, ASTNN, using our pre-trained InferCode model higher performance results are achieved with a significant margin for most of the tasks, including those involving different programming languages. The implementation of InferCode and the trained embeddings are made available at the anonymous link: https://github.com/ICSE21/infercode.

Keywords

code search, self supervised, code clone detection, cross language, fine tuning, code retrieval, unlabel data, unlabelled data

Discipline

Software Engineering

Research Areas

Software and Cyber-Physical Systems

Publication

2021 43rd International Conference on Software Engineering (ICSE): Virtual, May 25-28: Proceedings

First Page

1186

Last Page

1197

ISBN

9781665402965

Identifier

10.1109/ICSE43902.2021.00109

Publisher

IEEE

City or Country

Piscataway, NJ

Citation

BUI, Duy Quoc Nghi; YU, Yijun; and JIANG, Lingxiao. InferCode: Self-supervised learning of code representations by predicting subtrees. (2021). 2021 43rd International Conference on Software Engineering (ICSE): Virtual, May 25-28: Proceedings. 1186-1197.
Available at: https://ink.library.smu.edu.sg/sis_research/6716

Copyright Owner and License

Authors

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1109/ICSE43902.2021.00109

Download

Find it in your library

Included in

Software Engineering Commons

COinS

Research Collection School Of Computing and Information Systems

InferCode: Self-supervised learning of code representations by predicting subtrees

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

InferCode: Self-supervised learning of code representations by predicting subtrees

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links