Poster
POCE: Primal Policy Optimization with Conservative Estimation for Multi-constraint Offline Reinforcement Learning
Jiayi Guan · Li Shen · Ao Zhou · Lusong Li · Han Hu · Xiaodong He · Guang Chen · Changjun Jiang
Arch 4A-E Poster #204
Multi-constraint offline reinforcement learning (RL) promises to learn policies that satisfy both cumulative and state-wise costs from offline datasets. This arrangement provides an effective approach for the widespread application of RL in high-risk scenarios where both cumulative and state-wise costs need to be considered simultaneously. However, previously constrained offline RL algorithms are primarily designed to handle single-constraint problems related to cumulative cost, which faces challenges when addressing multi-constraint tasks that involve both cumulative and state-wise costs. In this work, we propose a novel Primal policy Optimization with Conservative Estimation algorithm (POCE) to address the problem of multi-constraint offline RL. Concretely, we reframe the objective of multi-constraint offline RL by introducing the concept of Maximum Markov Decision Processes (MMDP). Subsequently, we present a primal policy optimization algorithm to confront the multi-constraint problems, which improves the stability and convergence speed of model training. Furthermore, we propose a conditional Bellman operator to estimate cumulative and state-wise Q-values, reducing the extrapolation error caused by out-of-distribution (OOD) actions. Finally, extensive experiments demonstrate that the POCE algorithm achieves competitive performance across multiple experimental tasks, particularly outperforming baseline algorithms in terms of safety.