QUICK REVIEW

[论文解读] Constrained Policy Optimization

Joshua Achiam, David Held|arXiv (Cornell University)|May 30, 2017

Reinforcement Learning in Robotics参考文献 20被引用 111

一句话总结

约束性策略优化（CPO）是一种约束强化学习的策略搜索方法，在训练期间保证满足约束并通过使用代理目标和信任域更新来提升回报。

ABSTRACT

For many applications of reinforcement learning it can be more convenient to specify both a reward function and constraints, rather than trying to design behavior through the reward function. For example, systems that physically interact with or around humans should satisfy safety constraints. Recent advances in policy search algorithms (Mnih et al., 2016, Schulman et al., 2015, Lillicrap et al., 2016, Levine et al., 2016) have enabled new capabilities in high-dimensional control, but do not consider the constrained setting. We propose Constrained Policy Optimization (CPO), the first general-purpose policy search algorithm for constrained reinforcement learning with guarantees for near-constraint satisfaction at each iteration. Our method allows us to train neural network policies for high-dimensional control while making guarantees about policy behavior all throughout training. Our guarantees are based on a new theoretical result, which is of independent interest: we prove a bound relating the expected returns of two policies to an average divergence between them. We demonstrate the effectiveness of our approach on simulated robot locomotion tasks where the agent must satisfy constraints motivated by safety.

研究动机与目标

在强化学习中激励安全性和约束满足，超越无约束的奖励优化。
开发一种通用的策略搜索算法，能够处理带约束保证的 CMDP。
提供理论基础，将在约束条件下的策略绩效与平均策略发散度联系起来。
在强制安全相关约束的同时，使高维控制的神经网络策略训练成为可能。

提出的方法

引入 CPO，一种用于 CMDP 的信任域策略优化方法，保证单调改进和约束满足。
推导新的性能界限，将策略之间的回报差异与它们之间的平均发散联系起来。
使用可从样本估计的代理目标和约束，以实现实际更新。
提出一种实用的、基于共轭梯度的对偶优化方法，以在高维下高效求解更新。
通过塑造成本，使对成本的上界被强制执行，而非成本本身，从而提高对约束的满足度。

实验结果

研究问题

RQ1在学习过程中，策略搜索算法是否能够在执行 CMDP 约束的同时实现回报的单调提升？
RQ2在平均策略发散度下，从一个策略转到另一个策略时，如何对性能偏差进行界定？
RQ3基于信任域的更新是否能实现对神经网络策略的切实可行、可扩展的约束性策略优化？
RQ4成本塑形（对约束上界的设定）在实际中如何影响对安全约束的遵守？

主要发现

CPO 在仿真机器人行走任务的高维神经策略训练中，始终接近满足约束。
与原始-对偶优化（PDO）相比，CPO 在训练中更可靠地执行约束，且往往获得更好的回报。
通过对辅助成本上界来进行约束塑形，可以提高对真实安全约束的遵守，而不牺牲性能。
固定惩罚方法对惩罚值敏感，而 CPO 能自动平衡奖励与约束之间的权衡。
经验结果显示在没有约束的 TRPO 策略违反约束，说明必须进行约束优化。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。