Publication Type
Journal Article
Version
acceptedVersion
Publication Date
1-2026
Abstract
The rapid integration of Large Language Models (LLMs) into software engineering (SE) has revolutionized tasks from code generation to program repair, producing a massive volume of software artifacts. This surge in automated creation has exposed a critical bottleneck: the lack of scalable and reliable methods to evaluate the quality of these outputs. Human evaluation, while effective, is very costly and time-consuming. Traditional automated metrics like BLEU rely on high-quality references and struggle to capture nuanced aspects of software quality, such as readability and usefulness. In response, the LLM-as-a-Judge paradigm, which employs LLMs for automated evaluation, has emerged. This approach leverages the advanced reasoning and coding capabilities of LLMs themselves to perform automated evaluations, offering a compelling path toward achieving both the nuance of human insight and the scalability of automated systems. Nevertheless, LLM-as-a-Judge research in the SE community is still in its early stages, with many breakthroughs needed. This forward-looking SE 2030 paper aims to steer the research community toward advancing LLM-as-a-Judge for evaluating LLM-generated software artifacts, while also sharing potential research paths to achieve this goal. We provide a literature review of existing SE studies on LLM-as-a-Judge and envision these frameworks as reliable, robust, and scalable human surrogates capable of evaluating software artifacts with consistent, multi-faceted assessments by 2030 and beyond. To validate this vision, we analyze the limitations of current studies, identify key research gaps, and outline a detailed roadmap to guide future developments of LLM-as-a-Judge in software engineering. While not intended to be a definitive guide, our work aims to foster further research and adoption of LLM-as-a-Judge frameworks within the SE community, ultimately improving the effectiveness and scalability of software artifact evaluation methods.
Keywords
Large Language Models, Software Engineering, LLM-as-a-Judge, Research Roadmap
Discipline
Artificial Intelligence and Robotics | Software Engineering
Research Areas
Software and Cyber-Physical Systems
Publication
ACM Transactions on Software Engineering and Methodology
First Page
1
Last Page
28
ISSN
1049-331X
Identifier
10.1145/3797276
Publisher
Association for Computing Machinery (ACM)
Citation
HE, Junda; SHI, Jieke; ZHUO, Terry Yue; TREUDE, Christoph; SUN, Jiamou; XING, Zhenchang; DU, Xiaoning; and David LO.
LLM-as-a-Judge for software engineering: Literature review, vision, and the road ahead. (2026). ACM Transactions on Software Engineering and Methodology. 1-28.
Available at: https://ink.library.smu.edu.sg/sis_research/11079
Copyright Owner and License
Authors
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1145/3797276