Research Collection School Of Computing and Information Systems

Less is more: On the importance of data quality for unit test generation

Publication Type

Journal Article

Version

publishedVersion

Publication Date

6-2025

Abstract

Unit testing is crucial for software development and maintenance. Effective unit testing ensures and improves software quality, but writing unit tests is time-consuming and labor-intensive. Recent studies have proposed deep learning (DL) techniques or large language models (LLMs) to automate unit test generation. These models are usually trained or fine-tuned on large-scale datasets. Despite growing awareness of the importance of data quality, there has been limited research on the quality of datasets used for test generation. To bridge this gap, we systematically examine the impact of noise on the performance of learning-based test generation models. We first apply the open card sorting method to analyze the most popular and largest test generation dataset, Methods2Test, to categorize eight distinct types of noise. Further, we conduct detailed interviews with 17 domain experts to validate and assess the importance, reasonableness, and correctness of the noise taxonomy. Then, we propose CleanTest, an automated noise-cleaning framework designed to improve the quality of test generation datasets. CleanTest comprises three filters: a rule-based syntax filter, a rule-based relevance filter, and a model-based coverage filter. To evaluate its effectiveness, we apply CleanTest on two widely-used test generation datasets, i.e., Methods2Test and Atlas. Our findings indicate that 43.52% and 29.65% of datasets contain noise, highlighting its prevalence. Finally, we conduct comparative experiments using four LLMs (i.e., CodeBERT, AthenaTest, StarCoder, and CodeLlama7B) to assess the impact of noise on test generation performance. The results show that filtering noise positively influences the test generation ability of the models. Fine-tuning the four LLMs with the filtered Methods2Test dataset, on average, improves its performance by 67% in branch coverage, using the Defects4J benchmark. For the Atlas dataset, the four LLMs improve branch coverage by 39%. Additionally, filtering noise improves bug detection performance, resulting in a 21.42% increase in bugs detected by the generated tests.

Keywords

Unit Test Generation, Large Language Models, Dataset Quality

Discipline

Databases and Information Systems | Software Engineering | Theory and Algorithms

Research Areas

Software and Cyber-Physical Systems

Publication

Proceedings of the ACM on Software Engineering

Volume

Issue

FSE

First Page

1293

Last Page

1316

Identifier

10.1145/3715778

Publisher

Association for Computing Machinery

Citation

ZHANG, Junwei; HU, Xing; GAO, Shan; XIA, Xin; LO, David; and LI, Shanping. Less is more: On the importance of data quality for unit test generation. (2025). Proceedings of the ACM on Software Engineering. 2, (FSE), 1293-1316.
Available at: https://ink.library.smu.edu.sg/sis_research/10955

Copyright Owner and License

Authors-CC-BY

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1145/3715778

Download

Included in

Databases and Information Systems Commons, Software Engineering Commons, Theory and Algorithms Commons

COinS

Research Collection School Of Computing and Information Systems

Less is more: On the importance of data quality for unit test generation

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

Identifier

Publisher

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Less is more: On the importance of data quality for unit test generation

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

Identifier

Publisher

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links