Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
6-2020
Abstract
Software composition analysis depends on database of open-source library vulerabilities, curated by security researchers using various sources, such as bug tracking systems, commits, and mailing lists. We report the design and implementation of a machine learning system to help the curation by by automatically predicting the vulnerability-relatedness of each data item. It supports a complete pipeline from data collection, model training and prediction, to the validation of new models before deployment. It is executed iteratively to generate better models as new input data become available. We use self-training to significantly and automatically increase the size of the training dataset, opportunistically maximizing the improvement in the models' quality at each iteration. We devised new deployment stability metric to evaluate the quality of the new models before deployment into production, which helped to discover an error. We experimentally evaluate the improvement in the performance of the models in one iteration, with 27.59% maximum PR AUC improvements. Ours is the first of such study across a variety of data sources. We discover that the addition of the features of the corresponding commits to the features of issues/pull requests improve the precision for the recall values that matter. We demonstrate the effectiveness of self-training alone, with 10.50% PR AUC improvement, and we discover that there is no uniform ordering of word2vec parameters sensitivity across data sources.
Keywords
application security, open-source software, machine learning, classifiers ensemble, self-training
Discipline
Artificial Intelligence and Robotics | Software Engineering
Research Areas
Software and Cyber-Physical Systems
Publication
MSR '20: Proceedings of the 17th IEEE/ACM International Conference on Mining Software Repositories, Virtual, Seoul, October 5-6
First Page
32
Last Page
42
ISBN
9781450379571
Identifier
10.1145/3379597.3387461
Publisher
ACM
City or Country
New York
Citation
CHEN, Yang; SANTOSA, Andrew E.; ANG, Ming Yi; SHARMA, Abhishek; SHARMA, Asankhaya; and LO, David.
A machine learning approach for vulnerability curation. (2020). MSR '20: Proceedings of the 17th IEEE/ACM International Conference on Mining Software Repositories, Virtual, Seoul, October 5-6. 32-42.
Available at: https://ink.library.smu.edu.sg/sis_research/5627
Copyright Owner and License
Authors
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1145/3379597.3387461