Publication Type
PhD Dissertation
Version
publishedVersion
Publication Date
7-2025
Abstract
Real-world decision-making often involves safety constraints that are implicit, non-Markovian, or difficult to specify directly. Standard reinforcement learning (RL) approaches typically assume access to fully specified cost functions and constraint budgets—assumptions that limit their applicability in domains where such structure must instead be inferred from data. This dissertation develops a sequence of methods for learning safety-relevant structure from weak supervision, such as sparse binary feedback on trajectory segments, and using these signals to guide planning and policy optimization.
The first part of the dissertation introduces a sample-efficient method for planning in continuous Markov Decision Processes (MDPs) using deep reactive policies. This method, Iterative Lower-Bound Optimization (ILBO), provides a stable and sample-efficient framework for optimizing deterministic policies through iterative local improvement, and serves as a foundation for later developments in cost-aware learning. While ILBO assumes access to known reward and transition functions, the remainder of the dissertation relaxes this assumption by learning safety structure directly from data.
Subsequent chapters explore how to recover latent safety constraints from limited feedback. One contribution develops a method to infer non-Markovian constraints from sparse trajectory-level labels in known-dynamics settings. This is extended to unknown environments with the TraCeS algorithm, which actively selects informative trajectories for annotation and learns safety credit to guide safe reinforcement learning. Finally, the RLUC framework addresses the problem of learning unknown non-Markovian constraints in off-policy RL, integrating constraint inference with probabilistic modeling and policy optimization.
Together, these contributions offer a unified framework for constraint learning and safe decisionmaking from weak supervision, bridging planning and RL under uncertain safety structure.
Degree Awarded
PhD in Computer Science
Discipline
Artificial Intelligence and Robotics | Computer and Systems Architecture
Supervisor(s)
KUMAR, Akshat
First Page
1
Last Page
165
Publisher
Singapore Management University
City or Country
Singapore
Citation
LOW, Siow Meng.
From sparse feedback to sequential decision-making: Learning safety constraints with weak supervision. (2025). 1-165.
Available at: https://ink.library.smu.edu.sg/etd_coll/796
Copyright Owner and License
Author
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.