Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
3-2025
Abstract
Offline safe reinforcement learning (RL) has emerged as a promising approach for learning safe behaviors without engaging in risky online interactions with the environment. Most existing methods in offline safe RL rely on cost constraints at each time step (derived from global cost constraints) and this can result in either overly conservative policies or violation of safety constraints. In this paper, we propose to learn a policy that generates desirable trajectories and avoids undesirable trajectories. To be specific, we first partition the pre-collected dataset of state-action trajectories into desirable and undesirable subsets. Intuitively, the desirable set contains high reward and safe trajectories, and undesirable set contains unsafe trajectories and low-reward safe trajectories. Second, we learn a policy that generates desirable trajectories and avoids undesirable trajectories, where (un)desirability scores are provided by a classifier learnt from the dataset of desirable and undesirable trajectories. This approach bypasses the computational complexity and stability issues of a min-max objective that is employed in existing methods. Theoretically, we also show our approach's strong connections to existing learning paradigms involving human feedback. Finally, we extensively evaluate our method using the DSRL benchmark for offline safe RL. Empirically, our method outperforms competitive baselines, achieving higher rewards and better constraint satisfaction across a wide variety of benchmark tasks.
Discipline
Artificial Intelligence and Robotics
Research Areas
Intelligent Systems and Optimization
Areas of Excellence
Digital transformation
Publication
Proceedings of the 39th Annual AAAI Conference on Artificial Intelligence, Philadelphia, Pennsylvania, 2025 February 25 - March 4
First Page
16880
Last Page
16887
Identifier
10.1609/aaai.v39i16.33855
City or Country
Philadelphia, Pennsylvania
Citation
GONG, Ze; KUMAR, Akshat; and VARAKANTHAM, Pradeep.
Offline safe reinforcement learning using trajectory classification. (2025). Proceedings of the 39th Annual AAAI Conference on Artificial Intelligence, Philadelphia, Pennsylvania, 2025 February 25 - March 4. 16880-16887.
Available at: https://ink.library.smu.edu.sg/sis_research/10667
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1609/aaai.v39i16.33855