Research Collection School Of Computing and Information Systems

Automated identification of libraries from vulnerability data: can we do better?

Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

5-2022

Abstract

Software engineers depend heavily on software libraries and have to update their dependencies once vulnerabilities are found in them. Software Composition Analysis (SCA) helps developers identify vulnerable libraries used by an application. A key challenge is the identification of libraries related to a given reported vulnerability in the National Vulnerability Database (NVD), which may not explicitly indicate the affected libraries. Recently, researchers have tried to address the problem of identifying the libraries from an NVD report by treating it as an extreme multi-label learning (XML) problem, characterized by its large number of possible labels and severe data sparsity. As input, the NVD report is provided, and as output, a set of relevant libraries is returned. In this work, we evaluated multiple XML techniques. While previous work only evaluated a traditional XML technique, FastXML, we trained four other traditional XML models (DiSMEC, Parabel, Bonsai, ExtremeText) as well as two deep learning-based models (XML-CNN and LightXML). We compared both their effectiveness and the time cost of training and using the models for predictions. We find that other than DiSMEC and XML-CNN, recent XML models outperform the FastXML model by 3%–10% in terms of F1-scores on Top-k (k=1,2,3) predictions. Furthermore, we observe significant improvements in both the training and prediction time of these XML models, with Bonsai and Parabel model achieving 627x and 589x faster training time and 12x faster prediction time from the FastXML baseline. We discuss the implications of our experimental results and highlight limitations for future work to address.

Keywords

Multi-label classification, Machine learning, Vulnerability report

Discipline

Databases and Information Systems

Research Areas

Data Science and Engineering

Publication

Proceedings of the 30th International Conference on Program Comprehension, Virtual Event, 2022 May 16-17

ISBN

978-1-4503-7123-0

Identifier

10.1145/3377813.3381360

Publisher

Association for Computing Machinery

City or Country

New York

Citation

HARYONO, Stefanus A.; KANG, Hong Jin; SHARMA, Abhishek; SHARMA, Asankhaya; SANTOSA, Andrew E.; ANG, Ming Yi; and LO, David. Automated identification of libraries from vulnerability data: can we do better?. (2022). Proceedings of the 30th International Conference on Program Comprehension, Virtual Event, 2022 May 16-17.
Available at: https://ink.library.smu.edu.sg/sis_research/7690

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1145/3377813.3381360

Download

Included in

Databases and Information Systems Commons

COinS

Research Collection School Of Computing and Information Systems

Automated identification of libraries from vulnerability data: can we do better?

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

ISBN

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Automated identification of libraries from vulnerability data: can we do better?

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

ISBN

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links