Publication Type
PhD Dissertation
Version
publishedVersion
Publication Date
4-2026
Abstract
This dissertation investigates how to deploy Large Language Models (LLMs) effectively in enterprise settings, where accuracy, reliability, cost, privacy, and operational constraints often matter more than benchmark performance alone. Drawing on seventeen peer-reviewed publications (eleven published and six accepted for publication), the work develops and validates optimization strategies across three connected themes: retrieval-augmented generation (RAG), agentic AI for workflow automation, and deployment guidelines for real-world enterprise environments.
First, we study RAG optimization through systematic evaluation of open and proprietary models, highlighting conditions under which efficient open-weight models can match or exceed proprietary alternatives. To address a pervasive failure mode in open-source LLMs—output repetition—we introduce Repetition-Aware Performance (RAP), a metric that integrates repetition penalties into task performance. Across models ranging from 2B to 70B parameters, RAP-guided tuning reduces repetition by up to 93.1% with only 3.7% performance degradation, enabling more dependable RAG outputs in production.
Second, we develop multi-agent architectures for enterprise automation and propose reliability-centric evaluation for tool-using systems. An invoice reconciliation system achieves 99.9% task success with edge-deployed models at 523 Joules per task, while a structured diagnostic taxonomy identifies tool initialization as a dominant source of agent failures in smaller models. For scalable automated assessment, we introduce an LLM-as-a-Judge framework and the reliability metric Evaluation Completion Rate (ECR@1), enabling principled accuracy-reliability-cost trade-offs across a 175× pricing range; GPT-4o Mini emerges as production-optimal at 78× cost reduction versus premium models. A cooperative multi-agent system for acceptance test generation (CMAS4G2) demonstrates that open-weight models approach proprietary performance (87.6% vs. 90.4% success), with reasoning configuration profoundly impacting coordination. For payment processing, we design the first LLM-based agentic payment framework (HMASP), achieving 97.0% task success, and introduce Agentic Success Rate (ASR)—a trajectory-level workflow fidelity metric grounded in process-mining conformance checking. Across 18 LLMs and 90,000 task instances, ASR exposes systematic procedural deviations invisible to conventional metrics: most models bypass confirmation checkpoints during payment checkout despite near-perfect task success, with regulatory implications for PCI-DSS compliance. Ongoing work is evolving the research prototype into ConvPayMAS—an agentic payment system implementing Google’s AP2 three-mandate cryptographic verification with patent-pending security features—upon which an enterprise agentic evaluation and optimization framework is being designed.
Third, we examine deployment strategies across heterogeneous hardware and tasks. Benchmarking across six hardware platforms shows that platform choices (e.g., WSL vs. native Windows) can yield up to 21× inference speedups, while medium-sized models (7B–32B) achieve optimal performance-efficiency trade-offs. A central finding from 504 configurations across seven model families is that reasoning effectiveness is strongly task-dependent: reasoning degrades simple binary sentiment classification (up to −19.9 F1 pp, 100% failure rate) yet substantially improves complex emotion recognition (up to +16.0 F1 pp, only 14% failure rate). Entity matching evaluation across 46 model configurations and 8 benchmark datasets reveals that reasoning effects are also family-dependent—low reasoning is optimal for GPT-5, high reasoning is critical for open-weight models, and Claude models exhibit extreme asymmetric sensitivity—while open-weight models rival proprietary APIs (GPT-OSS:120b outperforming GPT-4.1 at 20× lower cost). Cross-domain studies in machine translation, logical reasoning, and sports analytics further validate generalizability of the proposed metrics and deployment principles.
Overall, this dissertation demonstrates that open-source LLMs, when optimized and evaluated with reliability-aware metrics, can support privacy-preserving and cost-effective enterprise deployment. The proposed metrics, architectural patterns, and empirically grounded guidelines provide a practical foundation for building robust generative AI systems in production settings.
Keywords
Large Language Models, Retrieval-Augmented Generation, Multi-Agent Systems, Agentic Commerce, Reasoning, Edge Deployment, Enterprise AI, Model Quantization, Few-Shot Learning, Fine-Tuning, Sentiment Analysis, Entity Matching, Tool Invocation, Evaluation Metrics, LoRA, Privacy-Preserving AI, Workflow Automation
Degree Awarded
Doctor of Engineering
Discipline
Artificial Intelligence and Robotics | Computer Sciences
Supervisor(s)
NGO, Chong Wah; WANG, Zhaoxia
First Page
1
Last Page
244
Publisher
Singapore Management University
City or Country
Singapore
Citation
HUANG, Donghao.
Generative AI in enterprises: optimizing applications with large language models. (2026). 1-244.
Available at: https://ink.library.smu.edu.sg/etd_coll/869
Copyright Owner and License
Author
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.