e-COP : Episodic Constrained Optimization of Policies
CoRR(2024)
摘要
In this paper, we present the algorithm, the first policy
optimization algorithm for constrained Reinforcement Learning (RL) in episodic
(finite horizon) settings. Such formulations are applicable when there are
separate sets of optimization criteria and constraints on a system's behavior.
We approach this problem by first establishing a policy difference lemma for
the episodic setting, which provides the theoretical foundation for the
algorithm. Then, we propose to combine a set of established and novel solution
ideas to yield the algorithm that is easy to implement and
numerically stable, and provide a theoretical guarantee on optimality under
certain scaling assumptions. Through extensive empirical analysis using
benchmarks in the Safety Gym suite, we show that our algorithm has similar or
better performance than SoTA (non-episodic) algorithms adapted for the episodic
setting. The scalability of the algorithm opens the door to its application in
safety-constrained Reinforcement Learning from Human Feedback for Large
Language or Diffusion Models.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要