QUICK REVIEW

[论文解读] The Ingredients of Real-World Robotic Reinforcement Learning

Henry Zhu, Justin Yu|arXiv (Cornell University)|Apr 27, 2020

Robot Manipulation and Learning参考文献 39被引用 28

一句话总结

本文提出 R3L，一种真实世界机器人强化学习系统，能够从原始视觉观测中学习灵巧操作技能，无需人工设计的奖励函数、重置机制或环境仪器。通过结合无监督表征学习与随机扰动控制器，该系统在真实三指机械手上实现了自主、连续的学习，成功从多种初始状态掌握阀门旋转和串珠操作等任务，训练过程中无需人工干预。

ABSTRACT

The success of reinforcement learning for real world robotics has been, in many cases limited to instrumented laboratory scenarios, often requiring arduous human effort and oversight to enable continuous learning. In this work, we discuss the elements that are needed for a robotic learning system that can continually and autonomously improve with data collected in the real world. We propose a particular instantiation of such a system, using dexterous manipulation as our case study. Subsequently, we investigate a number of challenges that come up when learning without instrumentation. In such settings, learning must be feasible without manually designed resets, using only on-board perception, and without hand-engineered reward functions. We propose simple and scalable solutions to these challenges, and then demonstrate the efficacy of our proposed system on a set of dexterous robotic manipulation tasks, providing an in-depth analysis of the challenges associated with this learning paradigm. We demonstrate that our complete system can learn without any human intervention, acquiring a variety of vision-based skills with a real-world three-fingered hand. Results and videos can be found at https://sites.google.com/view/realworld-rl/

研究动机与目标

在真实世界环境中实现无需人工干预的持续、自主机器人强化学习。
消除对人工设计奖励函数、人工重置或环境仪器的依赖。
开发一种可扩展的系统，从原始感官输入和自监督奖励信号中学习。
解决在非周期性、真实世界设置中探索与策略学习的挑战。

提出的方法

使用无监督表征学习（VAE）从原始 RGB 图像中提取有意义的状态表征。
采用随机扰动控制器模拟重置，无需预设状态，实现持续探索。
利用 VICE（视觉逆控制）从易于收集的目标图像中学习奖励函数，无需奖励工程。
使用 SAC（软演员评论家）在自监督奖励和原始观测上训练策略，实现端到端学习。
引入一种目标条件策略，可在无周期性重置的情况下泛化至多种初始构型。
在仅配备 RGB 摄像头的现实世界 D’Claw 机械手上部署该系统。

实验结果

研究问题

RQ1机器人系统如何在真实世界中学习复杂操作技能，而无需任何人工设计的奖励函数或环境仪器？
RQ2在无周期性、持续的真实世界训练中，无手动重置时，何种机制可实现有效的探索与策略学习？
RQ3从原始像素中进行的无监督表征学习是否能为灵巧操作任务提供鲁棒的策略训练？
RQ4与固定或基于目标的重置策略相比，随机扰动控制器在样本效率和性能鲁棒性方面表现如何？
RQ5在缺乏真实状态或奖励信号的情况下，系统在多大程度上可仅通过自监督监督和原始感官输入实现学习？

主要发现

R3L 系统在真实机械手上成功学习了灵巧操作任务——阀门旋转与串珠操作，且训练过程中无需任何人工干预。
采用扰动控制器训练的策略可从几乎所有初始配置中取得成功，优于 VICE 基线（后者在多数起始状态下失败）。
在阀门旋转任务中，系统在 17 小时真实世界训练后实现策略收敛，展示了在复杂任务中的可扩展性。
在串珠操作任务中，系统在 5 小时训练后学习到功能性策略，评估轨迹在多种初始状态下均表现出一致成功。
该方法对初始状态分布偏移具有鲁棒性，策略即使在任意起始位置评估时仍能良好泛化。
消融实验确认，无监督表征学习与扰动控制器对性能均至关重要，移除任一模块均导致成功率显著下降。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。