Publication Type

Journal Article

Version

publishedVersion

Publication Date

8-2025

Abstract

Regularly testing deep learning-powered systems on newly collected data is critical to ensure their reliability, robustness, and efficacy in real-world applications. This process is demanding due to the significant time and human effort required for labeling new data. While test selection methods alleviate manual labor by labeling and evaluating only a subset of data while meeting testing criteria, we observe that such methods with reported promising results are simply evaluated, e.g., testing on original test data. The question arises: are they always reliable? In this article, we explore when and to what extent test selection methods fail. First, we identify potential pitfalls of 11 selection methods based on their construction. Second, we conduct a study to empirically confirm the existence of these pitfalls. Furthermore, we demonstrate how pitfalls can break the reliability of these methods. Concretely, methods for fault detection suffer from data that are: (1) correctly classified but uncertain or (2) misclassified but confident. Remarkably, the test relative coverage achieved by such methods drops by up to 86.85\%. Besides, methods for performance estimation are sensitive to the choice of intermediate-layer output. The effectiveness of such methods can be even worse than random selection when using an inappropriate layer.

Keywords

deep learning testing, test selection, empirical study, fault detection, performance estimation

Discipline

Digital Communications and Networking | Software Engineering

Research Areas

Intelligent Systems and Optimization

Areas of Excellence

Digital transformation

Publication

ACM Transactions on Software Engineering and Methodology

Volume

34

Issue

7

First Page

1

Last Page

26

ISSN

1049-331X

Identifier

10.1145/3715693

Publisher

Association for Computing Machinery (ACM)

Additional URL

https://doi.org/10.1145/3715693

Share

COinS