Publication Type

Conference Proceeding Article

Version

acceptedVersion

Publication Date

9-2025

Abstract

Large language models (LLMs) have demonstrated impressive capabilities in code generation, achieving high scores on benchmarks such as HumanEval and MBPP. However, these benchmarks primarily assess functional correctness and neglect broader dimensions of code quality, including security, reliability, readability, and maintainability. In this work, we systematically evaluate the ability of LLMs to generate high-quality code across multiple dimensions using the PythonSecurityEval benchmark. We introduce an iterative static analysis-driven prompting algorithm that leverages Bandit and Pylint to identify and resolve code quality issues. Our experiments with GPT-4o show substantial improvements: security issues reduced from >40% to 13%, readability violations from >80% to 11%, and reliability warnings from >50% to 11% within ten iterations. These results demonstrate that LLMs, when guided by static analysis feedback, can significantly enhance code quality beyond functional correctness.

Keywords

Static analysis, automated program repair, large language models, code quality, code generation

Discipline

Software Engineering

Research Areas

Software and Cyber-Physical Systems

Publication

2025 IEEE International Conference on Source Code Analysis and Manipulation (SCAM): Auckland, September 8-9: Proceedings

First Page

100

Last Page

109

ISBN

9798331596989

Identifier

10.1109/SCAM67354.2025.00017

Publisher

IEEE Computer Society

City or Country

Los Alamos

Additional URL

https://doi.org/10.1109/SCAM67354.2025.00017

Share

COinS