Prompt engineering in LLMs for automated unit test generation: A large-scale study
Publication Type
Journal Article
Publication Date
3-2026
Abstract
Unit testing is essential for software reliability, yet manual test creation is time-consuming and often neglected. Although search-based software testing improves efficiency, it produces tests with poor readability and maintainability. Although LLMs show promise for test generation, existing research lacks comprehensive evaluation across execution-driven assessment, reasoning-based prompting, and real-world testing scenarios. This study presents the first large-scale empirical evaluation of LLM-generated unit tests at the full class level, systematically analyzing four state-of-the-art models (GPT-3.5, GPT-4, Mistral 7B, and Mixtral 8x7B) against EvoSuite across 216,300 generated test cases targeting Defects4J, SF110, and CMD (a dataset mitigating LLM training data leakage). We evaluate five prompting techniques-Zero-Shot Learning (ZSL), Few-Shot Learning (FSL), Chain-of-Thought (CoT), Tree-of-Thought (ToT), and Guided Tree-of-Thought (GToT)-assessing syntactic correctness, compilability, hallucination-driven failures, readability, code coverage metrics, and test smells. Reasoning-based prompting particularly GToT significantly enhances test reliability, compilability, and structural adherence in general-purpose LLMs. However, hallucination-driven failures remain a persistent challenge, manifesting as non-existent symbol references, incorrect API calls, and fabricated dependencies, resulting in high compilation failure rates (up to 86%). Moreover, test smell analysis reveals that while LLM-generated tests are generally more readable than those produced by traditional tools, they still suffer from recurring design issues such as Magic Number Tests and Assertion Roulette, which hinder maintainability. Overall, our findings indicate that LLMs can serve as effective assistive tools for generating readable and maintainable test suites, but hybrid approaches that combine LLM-based generation with automated validation and search-based refinement are required to achieve reliable and production-ready test generation.
Keywords
Automatic Test Generation, Unit Tests, Large Language Models, Prompt Engineering, Empirical Evaluation
Discipline
Software Engineering
Research Areas
Software and Cyber-Physical Systems
Publication
Empirical Software Engineering
Volume
31
Issue
4
First Page
1
Last Page
58
ISSN
1382-3256
Identifier
10.1007/s10664-026-10840-4
Publisher
Springer
Citation
Ouedraogo, Wendkuuni C.; Kabore, Abdoul Kader; Li, Yinghua; Tian, Haoye; Koyuncu, Anil; Klein, Jacques; LO, David; and Bissyande, Tegawende F..
Prompt engineering in LLMs for automated unit test generation: A large-scale study. (2026). Empirical Software Engineering. 31, (4), 1-58.
Available at: https://ink.library.smu.edu.sg/sis_research/11085
Additional URL
https://doi.org/10.1007/s10664-026-10840-4