QUICK REVIEW

[论文解读] DDCO: Discovery of Deep Continuous Options for Robot Learning from Demonstrations

Sanjay Krishnan, Roy Fox|arXiv (Cornell University)|Oct 15, 2017

Reinforcement Learning in Robotics被引用 50

一句话总结

tldr: DDCO 将 Deep Discovery of Options 扩展到连续控制，引入混合离散–连续的高层策略并使用交叉验证来选择选项数量；它从演示中学习用于机器人模仿的深度连续选项，在模拟和真实机器人任务中实现更高的样本效率和成功率。

ABSTRACT

An option is a short-term skill consisting of a control policy for a specified region of the state space, and a termination condition recognizing leaving that region. In prior work, we proposed an algorithm called Deep Discovery of Options (DDO) to discover options to accelerate reinforcement learning in Atari games. This paper studies an extension to robot imitation learning, called Discovery of Deep Continuous Options (DDCO), where low-level continuous control skills parametrized by deep neural networks are learned from demonstrations. We extend DDO with: (1) a hybrid categorical-continuous distribution model to parametrize high-level policies that can invoke discrete options as well continuous control actions, and (2) a cross-validation method that relaxes DDO's requirement that users specify the number of options to be discovered. We evaluate DDCO in simulation of a 3-link robot in the vertical plane pushing a block with friction and gravity, and in two physical experiments on the da Vinci surgical robot, needle insertion where a needle is grasped and inserted into a silicone tissue phantom, and needle bin picking where needles and pins are grasped from a pile and categorized into bins. In the 3-link arm simulation, results suggest that DDCO can take 3x fewer demonstrations to achieve the same reward compared to a baseline imitation learning approach. In the needle insertion task, DDCO was successful 8/10 times compared to the next most accurate imitation learning baseline 6/10. In the surgical bin picking task, the learned policy successfully grasps a single object in 66 out of 99 attempted grasps, and in all but one case successfully recovered from failed grasps by retrying a second time.

研究动机与目标

推动学习可重复使用的、分层次的子技能（选项），将高维观测映射到机器人任务中的连续动作。
开发一种混合类别–连续的高层策略，以在调用离散选项或执行直接动作之间做出选择。
引入一种离线交叉验证方法，在无需手动调参的情况下自动选择发现的选项数量。

提出的方法

通过对混合输出建模：离散选项和连续动作，将 DD0 框架扩展到连续控制。
使用期望梯度方法，在带潜在选项和结束条件的演示轨迹上最大化似然。
用混合分布表示高层策略，其中高层可以在选项和直接控制之间进行选择，并给出相应的梯度。
对折叠应用交叉验证方案，以选择泛化能力最佳的选项数量

实验结果

研究问题

RQ1DDCO 是否能够从演示中为机器人任务学习深度连续选项？
RQ2混合高层策略是否在学习效率和泛化方面优于平坦策略？
RQ3交叉验证是否能够在没有任务特异性调参的情况下可靠地选择选项数量？
RQ4学习到的选项在仿真和真实机器人操作任务中表现如何？

主要发现

与基线模仿学习方法相比，DDCO 在三连杆臂的仿真中实现了大约 2x 到 3x 的样本效率提升。
在针头插入任务中，DDCO 的分层策略实现了 8/10 的成功率，优于基线（下一优的模仿基线为 6/10）。
在外科分类拣选任务中，学习到的策略在 99 次尝试中抓取单个对象 66 次，并在大多数失败时通过重新尝试恢复，优于非分层方法。
DDCO 学到的选项具有可解释性，不同的选项在任务中分别专注于抓取、重新定向或基于图像的动作。
通过交叉验证的选项数量与最大任务奖励相关，从而实现离线选择选项数量。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。