Publication Type
Journal Article
Version
publishedVersion
Publication Date
8-2025
Abstract
Regularly testing deep learning-powered systems on newly collected data is critical to ensure their reliability, robustness, and efficacy in real-world applications. This process is demanding due to the significant time and human effort required for labeling new data. While test selection methods alleviate manual labor by labeling and evaluating only a subset of data while meeting testing criteria, we observe that such methods with reported promising results are simply evaluated, e.g., testing on original test data. The question arises: are they always reliable? In this article, we explore when and to what extent test selection methods fail. First, we identify potential pitfalls of 11 selection methods based on their construction. Second, we conduct a study to empirically confirm the existence of these pitfalls. Furthermore, we demonstrate how pitfalls can break the reliability of these methods. Concretely, methods for fault detection suffer from data that are: (1) correctly classified but uncertain or (2) misclassified but confident. Remarkably, the test relative coverage achieved by such methods drops by up to 86.85\%. Besides, methods for performance estimation are sensitive to the choice of intermediate-layer output. The effectiveness of such methods can be even worse than random selection when using an inappropriate layer.
Keywords
deep learning testing, test selection, empirical study, fault detection, performance estimation
Discipline
Digital Communications and Networking | Software Engineering
Research Areas
Intelligent Systems and Optimization
Areas of Excellence
Digital transformation
Publication
ACM Transactions on Software Engineering and Methodology
Volume
34
Issue
7
First Page
1
Last Page
26
ISSN
1049-331X
Identifier
10.1145/3715693
Publisher
Association for Computing Machinery (ACM)
Citation
HU, Qiang; GUO, Yuejun; XIE, Xiaofei; CORDY, Maxime; MA, Wei; PAPADAKIS, Mike; MA, Lei; and LE TRAON, Yves.
Assessing the robustness of test selection methods for deep neural networks. (2025). ACM Transactions on Software Engineering and Methodology. 34, (7), 1-26.
Available at: https://ink.library.smu.edu.sg/sis_research/10332
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1145/3715693