Research Collection School Of Computing and Information Systems

Data quality matters: A case study on data label correctness for security bug report prediction

Publication Type

Journal Article

Version

acceptedVersion

Publication Date

7-2022

Abstract

In the research of mining software repositories, we need to label a large amount of data to construct a predictive model. The correctness of the labels will affect the performance of a model substantially. However, limited studies have been performed to investigate the impact of mislabeled instances on a predictive model. To bridge the gap, in this article, we perform a case study on the security bug report (SBR) prediction. We found five publicly available datasets for SBR prediction contains many mislabeled instances, which lead to the poor performance of SBR prediction models of recent studies (e.g., the work of Peters et al. and Shu et al.). Furthermore, it might mislead the research direction of SBR prediction. In this article, we first improve the label correctness of these five datasets by manually analyzing each bug report, and we find 749 SBRs, which are originally mislabeled as Non-SBRs (NSBRs). We then evaluate the impacts of datasets label correctness by comparing the performance of the classification models on both the noisy (i.e., before our correction) and the clean (i.e., after our correction) datasets. The results show that the cleaned datasets result in improvement in the performance of classification models. The performance of the approaches proposed by Peters et al. and Shu et al. on the clean datasets is much better than on the noisy datasets. Furthermore, with the clean datasets, the simple text classification models could significantly outperform the security keywords-matrix-based approaches applied by Peters et al. and Shu et al.

Keywords

Computer bugs, Noise measurement, Predictive models, Security, Chromium, Tuning, Data models, Security bug report prediction, data quality, label correctness

Discipline

Software Engineering

Research Areas

Software and Cyber-Physical Systems

Publication

IEEE Transactions on Software Engineering

Volume

Issue

First Page

2541

Last Page

2556

ISSN

0098-5589

Identifier

10.1109/TSE.2021.3063727

Publisher

Institute of Electrical and Electronics Engineers

Citation

WU, Xiaoxue; ZHENG, Wei; XIA, Xin; and LO, David. Data quality matters: A case study on data label correctness for security bug report prediction. (2022). IEEE Transactions on Software Engineering. 48, (7), 2541-2556.
Available at: https://ink.library.smu.edu.sg/sis_research/7436

Copyright Owner and License

Authors

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1109/TSE.2021.3063727

Download

Included in

Software Engineering Commons

COinS

Research Collection School Of Computing and Information Systems

Data quality matters: A case study on data label correctness for security bug report prediction

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Data quality matters: A case study on data label correctness for security bug report prediction

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links