QUICK REVIEW

[论文解读] QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation

Dmitry Kalashnikov, Alex Irpan|arXiv (Cornell University)|Jun 27, 2018

Robot Manipulation and Learning参考文献 39被引用 574

一句话总结

QT-Opt 在离线数据基础上通过 modest on-policy 微调实现对未见对象的 96% 抓取成功率，适用于视觉基础闭环机器人抓取。

ABSTRACT

In this paper, we study the problem of learning vision-based dynamic manipulation skills using a scalable reinforcement learning approach. We study this problem in the context of grasping, a longstanding challenge in robotic manipulation. In contrast to static learning behaviors that choose a grasp point and then execute the desired grasp, our method enables closed-loop vision-based control, whereby the robot continuously updates its grasp strategy based on the most recent observations to optimize long-horizon grasp success. To that end, we introduce QT-Opt, a scalable self-supervised vision-based reinforcement learning framework that can leverage over 580k real-world grasp attempts to train a deep neural network Q-function with over 1.2M parameters to perform closed-loop, real-world grasping that generalizes to 96% grasp success on unseen objects. Aside from attaining a very high success rate, our method exhibits behaviors that are quite distinct from more standard grasping systems: using only RGB vision-based perception from an over-the-shoulder camera, our method automatically learns regrasping strategies, probes objects to find the most effective grasps, learns to reposition objects and perform other non-prehensile pre-grasp manipulations, and responds dynamically to disturbances and perturbations.

研究动机与目标

通过可扩展的离线策略强化学习学习基于视觉的闭环抓取。
将抓取策略泛化到此前未见过的对象。
展示前抓操作与再抓取的长期目标抓取能力。
展示用于大规模 RL 数据集的可扩展分布式训练架构。

提出的方法

引入 QT-Opt，一种没有显式行为者的连续动作 Q 学习框架。
使用带有交叉熵 Bellman 误差的 Q_theta(s,a) 函数以及两个目标网络以实现稳定性。
采用随机优化（CEM）来最大化非凸的 Q 函数以进行动作选择。
使用来自多个机器人的大规模离线数据（580k 次抓取）加上在线策略微调（约 28k 次抓取）进行训练。
实现带回放缓冲区和 Bellman 更新任务的分布式异步训练管线。

实验结果

研究问题

RQ1基于视觉输入的离线深度Q学习是否能够在动态抓取任务中实现较高的泛化能力？
RQ2长期目标强化学习是否能够在充满混乱、对象未见的场景中实现前抓操作与再抓取？
RQ3离线数据规模与在线微调对抓取性能的影响是什么？
RQ4QT-Opt 框架与未优化长期目标成功率的先前自监督抓取方法相比有何差异？

主要发现

方法	数据集	测试	箱子清空	前10	前20	前30
QT-Opt (ours)	580k off-policy + 28k on-policy	96%	88%	88%	76%
Levine et al. [27]	900k grasps from Levine et al. [27]	78%	76%	72%	72%

QT-Opt 在离线数据加上适度的在线策略微调后对未见对象实现了 96% 的抓取成功率。
仅离线策略训练就已超越先前的自监督抓取基线。
在线策略微调（约 28k 次抓取）通过实现困难负样本挖掘与长期目标优化带来可衡量的改进。
该策略展现了前抓操作、再抓取以及对动态扰动的处理等高级行为。
一个大规模分布式强化学习设置使得跨 7 个机器人进行 580k 次抓取成为可能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。