Publication Type

Journal Article

Version

publishedVersion

Publication Date

9-2020

Abstract

The interpretation of defect models heavily relies on software metrics that are used to construct them. Prior work often uses feature selection techniques to remove metrics that are correlated and irrelevant in order to improve model performance. Yet, conclusions that are derived from defect models may be inconsistent if the selected metrics are inconsistent and correlated. In this paper, we systematically investigate 12 automated feature selection techniques with respect to the consistency, correlation, performance, computational cost, and the impact on the interpretation dimensions. Through an empirical investigation of 14 publicly-available defect datasets, we find that (1) 94–100% of the selected metrics are inconsistent among the studied techniques; (2) 37–90% of the selected metrics are inconsistent among training samples; (3) 0–68% of the selected metrics are inconsistent when the feature selection techniques are applied repeatedly; (4) 5–100% of the produced subsets of metrics contain highly correlated metrics; and (5) while the most important metrics are inconsistent among correlation threshold values, such inconsistent most important metrics are highly-correlated with the Spearman correlation of 0.85–1. Since we find that the subsets of metrics produced by the commonly-used feature selection techniques (except for AutoSpearman) are often inconsistent and correlated, these techniques should be avoided when interpreting defect models. In addition to introducing AutoSpearman which mitigates correlated metrics better than commonly-used feature selection techniques, this paper opens up new research avenues in the automated selection of features for defect models to optimise for interpretability as well as performance.

Keywords

Software analytics, Defect prediction, Model interpretation, Feature selection

Discipline

Software Engineering

Research Areas

Software and Cyber-Physical Systems

Publication

Empirical Software Engineering

Volume

Issue

First Page

3590

Last Page

3638

ISSN

1382-3256

Identifier

10.1007/s10664-020-09848-1

Publisher

Springer

Citation

JIARPAKDEE, Jirayus; TANTITHAMTHAVORN, Chakkrit; and TREUDE, Christoph. The impact of automated feature selection techniques on the interpretation of defect models. (2020). Empirical Software Engineering. 25, (5), 3590-3638.
Available at: https://ink.library.smu.edu.sg/sis_research/8796

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1007/s10664-020-09848-1

Download

Included in

Software Engineering Commons

COinS

Research Collection School Of Computing and Information Systems

The impact of automated feature selection techniques on the interpretation of defect models

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

The impact of automated feature selection techniques on the interpretation of defect models

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links