QUICK REVIEW

[论文解读] Trial without Error: Towards Safe Reinforcement Learning via Human Intervention

William S. Saunders, Girish Sastry|arXiv (Cornell University)|Jul 17, 2017

Reinforcement Learning in Robotics参考文献 17被引用 110

一句话总结

本文将人类干预强化学习（HIRL）形式化，以在训练过程中通过让人类阻挡者模仿安全动作、训练阻挡者接管并在 Atari 游戏中评估可扩展性来防止灾难。结果在 Pong/Space Invaders 中未出现灾难，但在 Road Runner 中部分成功，并讨论了扩展中的挑战。

ABSTRACT

AI systems are increasingly applied to complex tasks that involve interaction with humans. During training, such systems are potentially dangerous, as they haven't yet learned to avoid actions that could cause serious harm. How can an AI system explore and learn without making a single mistake that harms humans or otherwise causes serious damage? For model-free reinforcement learning, having a human "in the loop" and ready to intervene is currently the only way to prevent all catastrophes. We formalize human intervention for RL and show how to reduce the human labor required by training a supervised learner to imitate the human's intervention decisions. We evaluate this scheme on Atari games, with a Deep RL agent being overseen by a human for four hours. When the class of catastrophes is simple, we are able to prevent all catastrophes without affecting the agent's learning (whereas an RL baseline fails due to catastrophic forgetting). However, this scheme is less successful when catastrophes are more complex: it reduces but does not eliminate catastrophes and the supervised learner fails on adversarial examples found by the agent. Extrapolating to more challenging environments, we show that our implementation would not scale (due to the infeasible amount of human labor required). We outline extensions of the scheme that are necessary if we are to train model-free agents without a single catastrophe.

研究动机与目标

定义一个针对模型自由强化学习的正式安全框架，并有人工监督以防止训练过程中的灾难。
提出 HIRL：一个人类在环方案，其中 Blocker 学习模仿人类的阻挡决策以替代不安全动作。
在 Atari 游戏上评估 HIRL，以评估不同代理的安全性表现与学习效率。
突出可扩展性挑战并概述在尽量减少人工劳动的前提下在可能的情况下保持零灾难安全的策略。

提出的方法

将强化学习建模为一个马尔可夫决策过程（MDP），并引入一个人工监督阶段，在该阶段人类阻挡灾难性动作并用安全动作替代。
收集状态-动作数据及人类是否阻挡的标签，以训练一个模仿阻挡决策的 Blocker 分类器。
一旦 Blocker 达到留出集的性能，就让人类退休，由 Blocker 进行监督；Blocker 也处理动作替换。
使用在原始 Atari 图像上训练的 CNN 基础的 Blocker，以实现对灾难性事件的低假阴性率。
将 HIRL 与一个仅惩罚灾难而不阻挡动作的奖励塑形基线进行对比。
分析对分布转移和对抗样本的鲁棒性，并讨论数据效率与人工时间成本。

实验结果

研究问题

RQ1在人类干预下，是否能够在简单和复杂灾难类别中防止在 RL 训练过程中的所有灾难性动作？
RQ2学习到的 Blocker 在模仿人类干预方面以及在不同 RL 代理和环境中的可扩展性如何？
RQ3在将 HIRL 应用于更复杂任务时，人类时间成本和可扩展性限制是什么？
RQ4在保持零灾难学习的前提下，需要哪些扩展以减小人力劳动？

主要发现

HIRL 在 Pong 和 Space Invaders 中实现了零灾难，在 Road Runner 中将灾难减少到原来的 1/50，尚未完全消除。
Blocker 可以跨代理和体系结构迁移，在 Pong 中阻挡灾难性动作且不妨碍学习。
带有大负惩罚的奖励塑形未能预防所有灾难，因为会出现灾难性遗忘和对抗性利用。
推断当前的 HIRL 设置在更长或更复杂的任务中不可行，原因是人工时间成本高。
Blocker 的鲁棒性可能被对抗性代理削弱，需提升数据效率和主动学习策略。
在 Pong 中，灾难可以局部避免，但非局部灾难揭示仅靠阻挡的局限性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。