Research Collection School Of Computing and Information Systems

Assessing the generalizability of code2vec token embeddings

Publication Type

Conference Proceeding Article

Version

acceptedVersion

Publication Date

11-2019

Abstract

Many Natural Language Processing (NLP) tasks, such as sentiment analysis or syntactic parsing, have benefited from the development of word embedding models. In particular, regardless of the training algorithms, the learned embeddings have often been shown to be generalizable to different NLP tasks. In contrast, despite recent momentum on word embeddings for source code, the literature lacks evidence of their generalizability beyond the example task they have been trained for. In this experience paper, we identify 3 potential downstream tasks, namely code comments generation, code authorship identification, and code clones detection, that source code token embedding models can be applied to. We empirically assess a recently proposed code token embedding model, namely code2vec’s token embeddings. Code2vec was trained on the task of predicting method names, and while there is potential for using the vectors it learns on other tasks, it has not been explored in literature. Therefore, we fill this gap by focusing on its generalizability for the tasks we have identified. Eventually, we show that source code token embeddings cannot be readily leveraged for the downstream tasks. Our experiments even show that our attempts to use them do not result in any improvements over less sophisticated methods. We call for more research into effective and general use of code embeddings

Keywords

Code Embeddings, Distributed Representations, Big Code

Discipline

Software Engineering

Research Areas

Software and Cyber-Physical Systems

Publication

2019 34th ACM/IEEE International Conference on Automated Software Engineering: San Diego, November 11-15: Proceedings

First Page

Last Page

ISBN

9781728125084

Identifier

10.1109/ASE.2019.00011

Publisher

IEEE

City or Country

Piscataway, NJ

Citation

KANG, Hong Jin; BISSYANDE, Tegawende F.; and LO, David. Assessing the generalizability of code2vec token embeddings. (2019). 2019 34th ACM/IEEE International Conference on Automated Software Engineering: San Diego, November 11-15: Proceedings. 1-12.
Available at: https://ink.library.smu.edu.sg/sis_research/4493

Copyright Owner and License

Authors

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1109/ASE.2019.00011

Download

Included in

Software Engineering Commons

COinS

Research Collection School Of Computing and Information Systems

Assessing the generalizability of code2vec token embeddings

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Assessing the generalizability of code2vec token embeddings

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links