Publication Type
Journal Article
Version
acceptedVersion
Publication Date
1-2024
Abstract
The costly human effort required to prepare the training data of machine learning (ML) models hinders their practical development and usage in software engineering (ML4Code), especially for those with limited budgets. Therefore, efficiently training models of code with less human effort has become an emergent problem. Active learning is such a technique to address this issue that allows developers to train a model with reduced data while producing models with desired performance, which has been well studied in computer vision and natural language processing domains. Unfortunately, there is no such work that explores the effectiveness of active learning for code models. In this paper, we bridge this gap by building the first benchmark to study this critical problem - active code learning. Specifically, we collect 11 acquisition functions (which are used for data selection in active learning) from existing works and adapt them for code-related tasks. Then, we conduct an empirical study to check whether these acquisition functions maintain performance for code data. The results demonstrate that feature selection highly affects active learning and using output vectors to select data is the best choice. For the code summarization task, active code learning is ineffective which produces models with over a 29.64% gap compared to the expected performance. Furthermore, we explore future directions of active code learning with an exploratory study. We propose to replace distance calculation methods with evaluation metrics and find a correlation between these evaluation-based distance methods and the performance of code models.
Keywords
Codes, Data Models, Task Analysis, Training, Feature Extraction, Training Data, Labeling, Active Learning, Machine Learning For Code, Benchmark, Empirical Analysis
Discipline
Software Engineering
Research Areas
Software and Cyber-Physical Systems
Publication
IEEE Transactions on Software Engineering
First Page
1
Last Page
17
ISSN
0098-5589
Identifier
10.1109/TSE.2024.3376964
Publisher
Institute of Electrical and Electronics Engineers
Citation
HU, Qiang; GUO, Yuejun; XIE, Xiaofei; CORDY, Maxime; MA, Lei; PAPADAKIS, Mike; and TRAON, Yves Le.
Active code learning: Benchmarking sample-efficient training of code models. (2024). IEEE Transactions on Software Engineering. 1-17.
Available at: https://ink.library.smu.edu.sg/sis_research/8695
Copyright Owner and License
Authors
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1109/TSE.2024.3376964