QUICK REVIEW

[论文解读] Safe Reinforcement Learning via Curriculum Induction

Matteo Turchetta, Andrey Kolobov|arXiv (Cornell University)|Jun 22, 2020

Reinforcement Learning in Robotics参考文献 44被引用 41

一句话总结

CISR 引入一个基于课程的安全强化学习框架，其中教学代理在训练期间通过干预来保持学习者的安全，同时在代际中演化课程以提升最终策略性能。

ABSTRACT

In safety-critical applications, autonomous agents may need to learn in an environment where mistakes can be very costly. In such settings, the agent needs to behave safely not only after but also while learning. To achieve this, existing safe reinforcement learning methods make an agent rely on priors that let it avoid dangerous situations during exploration with high probability, but both the probabilistic guarantees and the smoothness assumptions inherent in the priors are not viable in many scenarios of interest such as autonomous driving. This paper presents an alternative approach inspired by human teaching, where an agent learns under the supervision of an automatic instructor that saves the agent from violating constraints during learning. In this model, we introduce the monitor that neither needs to know how to do well at the task the agent is learning nor needs to know how the environment works. Instead, it has a library of reset controllers that it activates when the agent starts behaving dangerously, preventing it from doing damage. Crucially, the choices of which reset controller to apply in which situation affect the speed of agent learning. Based on observing agents' progress, the teacher itself learns a policy for choosing the reset controllers, a curriculum, to optimize the agent's final policy reward. Our experiments use this framework in two environments to induce curricula for safe and efficient learning.

研究动机与目标

在需要探索成本高或危险的安全关键环境中激发安全强化学习。
提出 CISR，一个使用干预确保学习过程安全而无需环境模型的师生框架。
开发一个课程策略，基于观察到的学生进展来优化干预序列。
提供理论保证，展示干预引入的 CMDP 的安全属性。
在具有挑战性的环境中展示经验上的安全性和效率益处，并展示跨代理的课程策略转移。

提出的方法

将干预定义为带有状态条件重置分布的触发集合。
将每个干预建模为一个修改过的 CMDP，保持可行性并可以覆盖动态以确保学习者安全。
在干预下引入学生学习问题，对安全违规和教师强加的约束进行约束。
将课程形式化为干预 CMDP 的序列，并定义一个根据学生表现统计数据自适应的课程策略。
将教师视为在线学习者，在多轮中使用评估特征和 GP-UCB 进行参数优化来优化课程策略。
描述实际实现选择，包括具备原-对-偶优化的 CMDP 求解器、跨干预的知识转移，以及带贝叶斯优化循环的反应性教师策略。

实验结果

研究问题

RQ1在不知道任务或环境的完整信息的情况下，老师如何保证 RL 代理在学习过程中的安全？
RQ2数据驱动的自适应课程策略是否能比固定或无课程设置更加强化安全学习？
RQ3干预引入的 CMDP 是否能实现对学生和任务的可迁移的安全学习？
RQ4课程设计对在安全约束下的最终策略性能有何影响？
RQ5教师在在线设置中，利用有限监督高效优化课程的方式？

主要发现

安全干预的课程能够在学习期间通过在检测到危险时重置到安全状态来保持学生的安全。
在某些条件下，在干预引入的 CMDP 中的学习可得出在移除教师后在原始 CMDP 中可行的策略。
一个数据驱动的在线教师通过观察进展统计改进跨代学生的课程策略。
在 Frozen Lake 和 Lunar Lander 的实证结果显示，课程优化的 CISR 能以保持安全性的前提下获得与非课程或固定干预基线相当或更好的最终奖励。
CISR 学习的课程策略可在具有不同架构和传感能力的代理之间良好转移。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。