Publication Type

PhD Dissertation

Version

publishedVersion

Publication Date

5-2025

Abstract

Deep Reinforcement Learning (RL) has achieved remarkable success over the past decade, from superhuman performance in video games to real-world applications like robotics. However, RL models often lack generalization, making them unreliable when deployed in unfamiliar scenarios. For example, robots must adapt to varying terrains with different slopes and obstacles, yet standard RL training does not explicitly promote such adaptability. While various methods have been proposed to enhance RL robustness, achieving reliable generalization remains an open challenge.

This dissertation focuses on improving the generalization capability of agents in three major settings: infinite horizon RL agents, finite horizon RL agents, and large language model-based agents (LLM agents). It is important to clarify that, in this dissertation, the terms infinite horizon and finite horizon refer to the number of training environments the agent is exposed to, rather than the temporal horizon (i.e., episode length) of the Markov Decision Process (MDP) within each environment. First, infinite horizon training pursues open-endedness, relying on the continual learning of RL agents across a vast number of environments (henceforth referred to as infinite-horizon training). This setting emphasizes progressively improving the agent's generalizability without prioritizing training efficiency. Second, finite horizon training targets high training efficiency, equipping agents with generalization capabilities through a limited number of training scenarios—crucial in applications where generating new scenarios is expensive or efficiency is critical. Third, beyond RL agents, this dissertation explores how to train LLM agents to generalize better in planning tasks, extending the scope to broader and more general settings. The core methodology of this dissertation is the introduction of diversity in training, proposing multiple diversity metrics and integrating them with RL and LLM training pipelines.

Diversity-Augmented Infinite Horizon Training.
In infinite horizon training, the RL agent learns through a very large number of training environments (hundreds of thousands) to continually improve its generalization capability in unfamiliar scenarios. Recently, Unsupervised Environment Design (UED) has emerged as a promising framework for training generalizable RL agents in this setting. By modeling training as an interactive process between a teacher agent and a student agent, UED enables the teacher to dynamically generate new training environments to improve the student’s adaptability in unseen scenarios. Existing UED algorithms predominantly focus on regret-based metrics, which generate challenging environments at the frontier of the student's capabilities. However, these approaches often overlook the importance of environmental diversity, which is crucial for robust learning.

To address this, this dissertation introduces Diversity-Augmented Infinite Horizon Training, incorporating two novel diversity metrics. The first metric computes the Wasserstein distance between occupancy measures of the student’s behavior in different environments, providing a pairwise measure of diversity. The second metric leverages a Gaussian Mixture Model (GMM) to quantify the novelty of a given environment relative to previously encountered ones, offering a scalable and computationally efficient measure of diversity. By dynamically exposing the student agent to diverse training environments, these metrics enable state-of-the-art generalization performance across multiple benchmarks.

Diversity-Augmented Finite Horizon Training.
Although infinite horizon training can equip RL agents with superior generalization, it suffers from low training efficiency. For example, training a robot to climb stairs might require thousands of simulation environments with varying heights and slopes. Such infinite training sequences are impractical in real-world applications where generating new scenarios is expensive or efficiency is crucial. To address this, we introduce Unsupervised Training Sequence Design (UTSD), a novel Markov Decision Process (MDP) formulation for the teacher agent. Unlike traditional UED, the teacher in UTSD curates a finite sequence of training environments by encoding key information about the student’s learning progress into its space. To achieve this, we employ the Quality Diversity approach to meticulously select diverse validation environments with respect to the student policy. Building on this framework, we propose Meta-Teacher, a meta-learning algorithm that enables the teacher’s efficient adaptation to unseen students by leveraging past experiences. Empirical evaluations demonstrate the effectiveness of Meta-Teacher by highlighting the teacher's generalization capability—specifically, its ability to design efficient and effective training sequences for students with varying levels of capability.

Diversity-Augmented Large Language Model Training.
Beyond the RL setting, this dissertation extends diversity-augmented training to large language model-based agents. While LLM agents exhibit strong capabilities for general tasks like writing and searching, their performance in planning tasks remains suboptimal. Extensive fine-tuning on task-specific data can yield strong results but incurs high computational and economic costs. To enhance sample efficiency and generalization, we propose Clustering-Based Maximum Diversity Sampling (CMDS), which selects diverse and representative training examples based on data structure and graph representations rather than traditional language embeddings. Empirical results show that our method consistently outperforms baseline methods across multiple benchmark domains.

This work contributes novel diversity metrics, frameworks, and algorithms for training generalizable RL and LLM agents, with broad implications across different settings and domains.

Keywords

Deep Reinforcement Learning, Large Language Models, AI, Robustness

Degree Awarded

PhD in Computer Science

Discipline

Artificial Intelligence and Robotics

Supervisor(s)

VARAKANTHAM, Pradeep Reddy

First Page

1

Last Page

120

Publisher

Singapore Management University

City or Country

Singapore

Copyright Owner and License

Author

Share

COinS