Research Collection School Of Computing and Information Systems

A machine learning approach for vulnerability curation

Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

6-2020

Abstract

Software composition analysis depends on database of open-source library vulerabilities, curated by security researchers using various sources, such as bug tracking systems, commits, and mailing lists. We report the design and implementation of a machine learning system to help the curation by by automatically predicting the vulnerability-relatedness of each data item. It supports a complete pipeline from data collection, model training and prediction, to the validation of new models before deployment. It is executed iteratively to generate better models as new input data become available. We use self-training to significantly and automatically increase the size of the training dataset, opportunistically maximizing the improvement in the models' quality at each iteration. We devised new deployment stability metric to evaluate the quality of the new models before deployment into production, which helped to discover an error. We experimentally evaluate the improvement in the performance of the models in one iteration, with 27.59% maximum PR AUC improvements. Ours is the first of such study across a variety of data sources. We discover that the addition of the features of the corresponding commits to the features of issues/pull requests improve the precision for the recall values that matter. We demonstrate the effectiveness of self-training alone, with 10.50% PR AUC improvement, and we discover that there is no uniform ordering of word2vec parameters sensitivity across data sources.

Keywords

application security, open-source software, machine learning, classifiers ensemble, self-training

Discipline

Artificial Intelligence and Robotics | Software Engineering

Research Areas

Software and Cyber-Physical Systems

Publication

MSR '20: Proceedings of the 17th IEEE/ACM International Conference on Mining Software Repositories, Virtual, Seoul, October 5-6

First Page

Last Page

ISBN

9781450379571

Identifier

10.1145/3379597.3387461

Publisher

ACM

City or Country

New York

Citation

CHEN, Yang; SANTOSA, Andrew E.; ANG, Ming Yi; SHARMA, Abhishek; SHARMA, Asankhaya; and LO, David. A machine learning approach for vulnerability curation. (2020). MSR '20: Proceedings of the 17th IEEE/ACM International Conference on Mining Software Repositories, Virtual, Seoul, October 5-6. 32-42.
Available at: https://ink.library.smu.edu.sg/sis_research/5627

Copyright Owner and License

Authors

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1145/3379597.3387461

Download

Find it in your library

Included in

Artificial Intelligence and Robotics Commons, Software Engineering Commons

COinS

Research Collection School Of Computing and Information Systems

A machine learning approach for vulnerability curation

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

A machine learning approach for vulnerability curation

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links