Publication Type

PhD Dissertation

Version

publishedVersion

Publication Date

4-2026

Abstract

This dissertation investigates how to deploy Large Language Models (LLMs) effectively in enterprise settings, where accuracy, reliability, cost, privacy, and operational constraints often matter more than benchmark performance alone. Drawing on seventeen peer-reviewed publications (eleven published and six accepted for publication), the work develops and validates optimization strategies across three connected themes: retrieval-augmented generation (RAG), agentic AI for workflow automation, and deployment guidelines for real-world enterprise environments.

First, we study RAG optimization through systematic evaluation of open and proprietary models, highlighting conditions under which efficient open-weight models can match or exceed proprietary alternatives. To address a pervasive failure mode in open-source LLMs—output repetition—we introduce Repetition-Aware Performance (RAP), a metric that integrates repetition penalties into task performance. Across models ranging from 2B to 70B parameters, RAP-guided tuning reduces repetition by up to 93.1% with only 3.7% performance degradation, enabling more dependable RAG outputs in production.

Second, we develop multi-agent architectures for enterprise automation and propose reliability-centric evaluation for tool-using systems. An invoice reconciliation system achieves 99.9% task success with edge-deployed models at 523 Joules per task, while a structured diagnostic taxonomy identifies tool initialization as a dominant source of agent failures in smaller models. For scalable automated assessment, we introduce an LLM-as-a-Judge framework and the reliability metric Evaluation Completion Rate (ECR@1), enabling principled accuracy-reliability-cost trade-offs across a 175× pricing range; GPT-4o Mini emerges as production-optimal at 78× cost reduction versus premium models. A cooperative multi-agent system for acceptance test generation (CMAS4G2) demonstrates that open-weight models approach proprietary performance (87.6% vs. 90.4% success), with reasoning configuration profoundly impacting coordination. For payment processing, we design the first LLM-based agentic payment framework (HMASP), achieving 97.0% task success, and introduce Agentic Success Rate (ASR)—a trajectory-level workflow fidelity metric grounded in process-mining conformance checking. Across 18 LLMs and 90,000 task instances, ASR exposes systematic procedural deviations invisible to conventional metrics: most models bypass confirmation checkpoints during payment checkout despite near-perfect task success, with regulatory implications for PCI-DSS compliance. Ongoing work is evolving the research prototype into ConvPayMAS—an agentic payment system implementing Google’s AP2 three-mandate cryptographic verification with patent-pending security features—upon which an enterprise agentic evaluation and optimization framework is being designed.

Third, we examine deployment strategies across heterogeneous hardware and tasks. Benchmarking across six hardware platforms shows that platform choices (e.g., WSL vs. native Windows) can yield up to 21× inference speedups, while medium-sized models (7B–32B) achieve optimal performance-efficiency trade-offs. A central finding from 504 configurations across seven model families is that reasoning effectiveness is strongly task-dependent: reasoning degrades simple binary sentiment classification (up to −19.9 F1 pp, 100% failure rate) yet substantially improves complex emotion recognition (up to +16.0 F1 pp, only 14% failure rate). Entity matching evaluation across 46 model configurations and 8 benchmark datasets reveals that reasoning effects are also family-dependent—low reasoning is optimal for GPT-5, high reasoning is critical for open-weight models, and Claude models exhibit extreme asymmetric sensitivity—while open-weight models rival proprietary APIs (GPT-OSS:120b outperforming GPT-4.1 at 20× lower cost). Cross-domain studies in machine translation, logical reasoning, and sports analytics further validate generalizability of the proposed metrics and deployment principles.

Overall, this dissertation demonstrates that open-source LLMs, when optimized and evaluated with reliability-aware metrics, can support privacy-preserving and cost-effective enterprise deployment. The proposed metrics, architectural patterns, and empirically grounded guidelines provide a practical foundation for building robust generative AI systems in production settings.

Keywords

Large Language Models, Retrieval-Augmented Generation, Multi-Agent Systems, Agentic Commerce, Reasoning, Edge Deployment, Enterprise AI, Model Quantization, Few-Shot Learning, Fine-Tuning, Sentiment Analysis, Entity Matching, Tool Invocation, Evaluation Metrics, LoRA, Privacy-Preserving AI, Workflow Automation

Degree Awarded

Doctor of Engineering

Discipline

Artificial Intelligence and Robotics | Computer Sciences

Supervisor(s)

NGO, Chong Wah; WANG, Zhaoxia

First Page

1

Last Page

244

Publisher

Singapore Management University

City or Country

Singapore

Copyright Owner and License

Author

Share

COinS