QUICK REVIEW

[论文解读] DART: Noise Injection for Robust Imitation Learning

Michael Laskey, Jonathan Lee|arXiv (Cornell University)|Mar 27, 2017

Reinforcement Learning in Robotics被引用 78

一句话总结

DART 将经过优化的噪声注入到监督者示范中，以减轻模仿学习中的协变量偏移，在性能与 DAgger 相当的同时，更高效且对人类更安全。

ABSTRACT

One approach to Imitation Learning is Behavior Cloning, in which a robot observes a supervisor and infers a control policy. A known problem with this "off-policy" approach is that the robot's errors compound when drifting away from the supervisor's demonstrations. On-policy, techniques alleviate this by iteratively collecting corrective actions for the current robot policy. However, these techniques can be tedious for human supervisors, add significant computation burden, and may visit dangerous states during training. We propose an off-policy approach that injects noise into the supervisor's policy while demonstrating. This forces the supervisor to demonstrate how to recover from errors. We propose a new algorithm, DART (Disturbances for Augmenting Robot Trajectories), that collects demonstrations with injected noise, and optimizes the noise level to approximate the error of the robot's trained policy during data collection. We compare DART with DAgger and Behavior Cloning in two domains: in simulation with an algorithmic supervisor on the MuJoCo tasks (Walker, Humanoid, Hopper, Half-Cheetah) and in physical experiments with human supervisors training a Toyota HSR robot to perform grasping in clutter. For high dimensional tasks like Humanoid, DART can be up to $3x$ faster in computation time and only decreases the supervisor's cumulative reward by $5\%$ during training, whereas DAgger executes policies that have $80\%$ less cumulative reward than the supervisor. On the grasping in clutter task, DART obtains on average a $62\%$ performance increase over Behavior Cloning.

研究动机与目标

解决离策略模仿学习（Behavior Cloning）中的协变量偏移。
提供一种噪声注入的离策略方法，使学习者暴露于纠正机会。
相较于像 DAgger 这样的在线策略方法，减轻监督者负担和计算成本。
在 MuJoCo 移动任务和现实世界的杂物抓取任务中展示 DART 的有效性。

提出的方法

引入 DART（Disturbances for Augmenting Robot Trajectories），在示范过程中对监督者的策略进行噪声注入。
构建噪声优化以使监督者的带噪示范与机器人最终策略对齐。
推导一个迭代过程（Algorithm 1），在带噪监督下更新噪声统计以最小化机器人控制的负对数似然。
给出一个理论界限，证明通过轨迹分布之间的 KL 散度降低协变量偏移。
在迭代方案中给出高斯噪声协方差的闭式更新。
在 MuJoCo 移动任务和 Toyota HSR 在杂物抓取任务上进行评估，既有算法监督也有人工监督。

实验结果

研究问题

RQ1DART 是否能像在线策略方法那样有效地降低协变量偏移？
RQ2在数据收集过程中，DART 如何影响计算时间和监督者的奖励？
RQ3在人类监督下，DART 能否产生更好的示范？
RQ4在高维机器人任务中，DART 与 Behavior Cloning 和 DAgger 的比较如何？

主要发现

DART 在 MuJoCo 移动域上与 DAgger 相当，同时提供显著更低的计算时间（例如 Humanoid：约快3倍）。
在训练期间，DART 相较于监督者使监督者累积奖励下降约 5%，而 DAgger 的策略比监督者的累积奖励低超过 80% 。
在人类监督的杂物抓取任务中，DART 以适当的噪声水平平均实现相比 Behavior Cloning 的 62% 性能提升。
未经优化的各向同性高斯噪声表现不佳，可能产生不安全的策略，凸显需要优化的噪声。
DART 在高维任务中显示出显著提升，并通过更好地匹配机器人最终轨迹分布来减少协变量偏移，相比 Behavior Cloning 更优。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。