QUICK REVIEW

[论文解读] Reinforcement Learning with Convex Constraints

Sobhan Miryoosefi, Kianté Brantley|arXiv (Cornell University)|Jun 21, 2019

Reinforcement Learning in Robotics参考文献 15被引用 32

一句话总结

tldr: 我们提出 APPROPO 框架，通过将约束满足问题置于 Blackwell 风格的可接近性博弈中，并使用 Online Convex Optimization 进行求解，使强化学习在任意凸约束下可行。该方法将任何优化标量奖励的 RL 算法与对约束度量的无后悔学习者模块化地结合，以实现可行性，并在可行时尽量减小到约束集的距离。

ABSTRACT

In standard reinforcement learning (RL), a learning agent seeks to optimize the overall reward. However, many key aspects of a desired behavior are more naturally expressed as constraints. For instance, the designer may want to limit the use of unsafe actions, increase the diversity of trajectories to enable exploration, or approximate expert trajectories when rewards are sparse. In this paper, we propose an algorithmic scheme that can handle a wide class of constraints in RL tasks: specifically, any constraints that require expected values of some vector measurements (such as the use of an action) to lie in a convex set. This captures previously studied constraints (such as safety and proximity to an expert), but also enables new classes of constraints (such as diversity). Our approach comes with rigorous theoretical guarantees and only relies on the ability to approximately solve standard RL tasks. As a result, it can be easily adapted to work with any model-free or model-based RL. In our experiments, we show that it matches previous algorithms that enforce safety via constraints, but can also enforce new properties that these algorithms do not incorporate, such as diversity.

研究动机与目标

将更适合用向量化度量表达的学习目标来激励，而非单一标量奖励（例如安全性、探索多样性）。
建立一个通用的算法框架，处理RL任务中对长期度量的任意凸约束。
提供理论保证（亚线性 regret 与到集合的收敛距离）以及实现方面的实用指导。

提出的方法

将问题表述为找到一个混合策略，使其长期度量向量落在凸约束集 C 中（可行性问题）。
将 dist(z(µ), C) 表示为对一个对偶锥的最大化，从而得到策略玩家和约束玩家之间的零和博弈。
对约束玩家使用无后悔的在线学习者（OGD），对策略玩家使用求解标量奖励 r = −λ · z 的标准 RL 的最佳响应 oracle。
通过迭代选择 λ、通过 BESTRESPONSE 求解 πt、用 EST 估计 z(πt)，并用投影到 C 的极锥（polar cone）上的 online gradient descent 更新 λ，从而实现 APPROPO。
通过使用极锥 Λ = C◦ ∩ B 以及基于投影的更新来处理不仅限于象限的任意凸约束。
通过将非锥形凸集提升到锥形（利用圆锥外形构造）并给出相应保证，提供对非锥凸集的扩展。

实验结果

研究问题

RQ1是否可以通过博弈论化简为可接近性来解决带任意凸约束的 RL？
RQ2如何将无后悔学习者与标准 RL 求解器结合，以强制实现向量值约束的满足？
RQ3可以为 APPROPO 建立哪些理论保证（ regret 边界、对约束集合的收敛性）？

主要发现

APPROPO 产生的混合策略的长期度量在子线性回报项的范围内近似目标凸约束集。
对于可行问题，APPROPO 保证 dist(z(¯µ), C) 收敛到零，收敛速率由在线学习者的 regret 和估计误差共同决定。
在火星探测车网格世界的实验表明，APPROPO 在正交象限约束上与 RCPO 相当，并且能够强制执行 RCPO 无法实现的多样性约束。
该框架与常见的 RL 方法（例如 actor-critic）兼容，并且可以利用正响应 oracle 来求解可行性问题。
在扩展到一般凸集时，圆锥提升提供保证，近似最小化到 C 的距离仍然可实现。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。