QUICK REVIEW

[论文解读] End-to-End Robotic Reinforcement Learning without Reward Engineering

Avi Singh, Larry Yang|arXiv (Cornell University)|Apr 16, 2019

Reinforcement Learning in Robotics被引用 32

一句话总结

本文提出 RAQ 和 VICE-RAQ，通过使用主动二元查询和离策略分类器奖励，从像素观测中学习机器人技能，无需手动设计的奖励。

ABSTRACT

The combination of deep neural network models and reinforcement learning algorithms can make it possible to learn policies for robotic behaviors that directly read in raw sensory inputs, such as camera images, effectively subsuming both estimation and control into one model. However, real-world applications of reinforcement learning must specify the goal of the task by means of a manually programmed reward function, which in practice requires either designing the very same perception pipeline that end-to-end reinforcement learning promises to avoid, or else instrumenting the environment with additional sensors to determine if the task has been performed successfully. In this paper, we propose an approach for removing the need for manual engineering of reward specifications by enabling a robot to learn from a modest number of examples of successful outcomes, followed by actively solicited queries, where the robot shows the user a state and asks for a label to determine whether that state represents successful completion of the task. While requesting labels for every single state would amount to asking the user to manually provide the reward signal, our method requires labels for only a tiny fraction of the states seen during training, making it an efficient and practical approach for learning skills without manually engineered rewards. We evaluate our method on real-world robotic manipulation tasks where the observations consist of images viewed by the robot's camera. In our experiments, our method effectively learns to arrange objects, place books, and drape cloth, directly from images and without any manually specified reward functions, and with only 1-4 hours of interaction with the real world.

研究动机与目标

在真实机器人上从像素观测实现端到端强化学习，且不依赖手工设计的奖励。
使用少量正向结果实例加上二元主动查询来定义奖励。
将数据和标注负担降至现实世界机器人应用的可行水平。
在实现高效学习的同时，减轻奖励模型中分类器被利用的问题。

提出的方法

在高维观测上训练一个目标分类器，以提供对数概率奖励。
使用主动查询对高概率状态进行标注，收集少量二元成功标签。
在最大熵 RL 框架下采用带分类器奖励的 soft actor-critic (SAC)。
将 VICE 扩展到离策略学习，以利用回放缓冲数据提高效率。
将主动查询与 VICE 集成，形成 VICE-RAQ，用于基于图像的操作任务。

实验结果

研究问题

RQ1能否从图像中端到端地学习机器人技能，而无需手工设计的奖励？
RQ2少量正向示例加上二元主动查询如何足以学习有效的奖励？
RQ3结合主动查询的离策略 VICE 是否能提高数据效率和现实世界的适用性？
RQ4在仿真和现实世界中的基于图像的操作任务上，RAQ 与 VICE-RAQ 的表现如何？

主要发现

RAQ 与 VICE-RAQ 能够从像素观测中进行有效学习，而无需手动设计的奖励。
在仿真中，VICE-RAQ 在 Visual Pusher、Visual Door Opening、Visual Picker 等任务上优于其他方法。
真实世界实验表明在 1-4 小时的交互内学习布料覆蓋、书本放置和杯子放在杯垫上的任务。
主动二元查询（每次运行 25–75 次）相比完全标注显著减少所需标注量。
离策略的 VICE-RAQ 能够高效利用回放缓冲中的数据，同时减轻分类器被利用的问题。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。