Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

6-2020

Abstract

Software composition analysis depends on database of open-source library vulerabilities, curated by security researchers using various sources, such as bug tracking systems, commits, and mailing lists. We report the design and implementation of a machine learning system to help the curation by by automatically predicting the vulnerability-relatedness of each data item. It supports a complete pipeline from data collection, model training and prediction, to the validation of new models before deployment. It is executed iteratively to generate better models as new input data become available. We use self-training to significantly and automatically increase the size of the training dataset, opportunistically maximizing the improvement in the models' quality at each iteration. We devised new deployment stability metric to evaluate the quality of the new models before deployment into production, which helped to discover an error. We experimentally evaluate the improvement in the performance of the models in one iteration, with 27.59% maximum PR AUC improvements. Ours is the first of such study across a variety of data sources. We discover that the addition of the features of the corresponding commits to the features of issues/pull requests improve the precision for the recall values that matter. We demonstrate the effectiveness of self-training alone, with 10.50% PR AUC improvement, and we discover that there is no uniform ordering of word2vec parameters sensitivity across data sources.

Keywords

application security, open-source software, machine learning, classifiers ensemble, self-training

Discipline

Artificial Intelligence and Robotics | Software Engineering

Research Areas

Software and Cyber-Physical Systems

Publication

MSR '20: Proceedings of the 17th IEEE/ACM International Conference on Mining Software Repositories, Virtual, Seoul, October 5-6

First Page

32

Last Page

42

ISBN

9781450379571

Identifier

10.1145/3379597.3387461

Publisher

ACM

City or Country

New York

Copyright Owner and License

Authors

Additional URL

https://doi.org/10.1145/3379597.3387461

Share

COinS