Publication Type
PhD Dissertation
Version
publishedVersion
Publication Date
10-2025
Abstract
The integration of Large Language Models (LLMs), particularly those tailored for programming tasks—referred to as code LLMs—has created novel opportunities to enhance developer productivity. These advanced models automate routine and repetitive coding tasks, such as code generation and debugging, and enable faster prototyping and more efficient problem-solving. Despite these remarkable advantages, the current generation of code LLMs exhibits notable limitations that impact their practical effectiveness in real-world software engineering scenarios. These models frequently produce code that is inefficient or suboptimal in runtime performance, demonstrate opaque reasoning processes, and struggle to adapt effectively to diverse developer contexts and specific requirements. Moreover, existing code LLMs often exhibit suboptimal performance in downstream software engineering tasks—such as code review, API analysis, or documentation generation—where alignment with human-centric practices, domain-specific knowledge, and structured workflows is essential. Such limitations diminish developers’ trust in these tools and hinder their full adoption and productive integration into software development workflows. To address these limitations, this dissertation proposes leveraging structured knowledge from the software engineering community—drawing on insights from developer forums, structured reasoning in coding discussions, and practical experiences shared on Q&A platforms—as guidance to substantially enhance the capabilities of code LLMs and enable their adaptation to downstream software engineering tasks. Specifically, community knowledge is leveraged in two complementary ways: (1) to guide the training of code LLMs for improving generation quality, e.g., enhancing code accuracy, reasoning quality, and runtime efficiency; and (2) to enable the adaptation of code LLMs to downstream software engineering tasks by aligning model behavior with the practical requirements and collaborative conventions observed in real-world development workflows.
Specifically, this dissertation includes five contributions, in which the first two works primarily enhance the code generation quality of code LLMs through guided training with structured community knowledge, while the latter three focus on adaptation of code LLMs to downstream software engineering tasks: i. The first work investigates structured code reasoning with community-derived logic. This work introduces a structured reasoning framework that extracts and organizes developer discussions into step-by-step reasoning chains aligned with software development phases. These structured reasoning chains are then used to fine-tune a specialized code LLM that learns to generate both reasoning chains and code. The resulting model exhibits substantial improvements in reasoning quality and correctness and surpasses GPT-4, as demonstrated on challenging coding benchmarks. ii. The second work explores reinforcement learning (RL) techniques to enhance the runtime efficiency of LLMs generated code. This approach leverages community curated resources, e.g., publicly available code snippets, corresponding test cases, and real-world execution environments, to provide performance-based feedbacks. By incorporating these signals into the RL training loop, the model learns to generate code that is not only functionally correct but also optimized for runtime performance. iii. The third work presents APIDocBooster, an extract-then-abstract summarization framework that enriches formal API documentation with practical insights derived from community-driven Q&A content. APIDocBooster incorporates domain-specific models to accurately classify content into documentation sections and employs LLMs to generate cohesive, informative, readable, and complementary summaries. iv. The fourth work focuses on structured classification of API reviews. Recognizing that API reviews on developer forums are often fragmented and ambiguous, this approach introduces a transformer-based, aspect-focused classification framework to systematically organize API reviews. This classification framework simplifies the extraction of actionable insights on key API characteristics from vast API reviews. v. The fifth work summarizes the development of TechSumBot, a query-focused summarization approach specifically designed for technical Q&A environments. TechSumBot integrates contrastive learning and domain-specific semantic models to accurately identify and distill essential knowledge from multiple answers, enabling developers to efficiently navigate and assimilate information from extensive technical Q&A resources. Empirical validations demonstrate marked improvements over traditional summarization methods.
Degree Awarded
PhD in Computer Science
Discipline
Artificial Intelligence and Robotics | Programming Languages and Compilers
Supervisor(s)
LO, David
First Page
1
Last Page
186
Publisher
Singapore Management University
City or Country
Singapore
Citation
YANG, Chengran.
Harnessing SE community knowledge for developer-centric code intelligence. (2025). 1-186.
Available at: https://ink.library.smu.edu.sg/etd_coll/802
Copyright Owner and License
Author
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Included in
Artificial Intelligence and Robotics Commons, Programming Languages and Compilers Commons