QUICK REVIEW

[论文解读] Robust Imitation of Diverse Behaviors

Ziyu Wang, Josh Merel|arXiv (Cornell University)|Jul 10, 2017

Reinforcement Learning in Robotics参考文献 36被引用 63

一句话总结

本文将基于变分自编码器（VAE）的策略与条件化的 GAN 风格模仿目标相结合，在高维机器人系统中实现跨多种行为的鲁棒、多样化的一次性模仿学习。

ABSTRACT

Deep generative models have recently shown great promise in imitation learning for motor control. Given enough data, even supervised approaches can do one-shot imitation learning; however, they are vulnerable to cascading failures when the agent trajectory diverges from the demonstrations. Compared to purely supervised methods, Generative Adversarial Imitation Learning (GAIL) can learn more robust controllers from fewer demonstrations, but is inherently mode-seeking and more difficult to train. In this paper, we show how to combine the favourable aspects of these two approaches. The base of our model is a new type of variational autoencoder on demonstration trajectories that learns semantic policy embeddings. We show that these embeddings can be learned on a 9 DoF Jaco robot arm in reaching tasks, and then smoothly interpolated with a resulting smooth interpolation of reaching behavior. Leveraging these policy representations, we develop a new version of GAIL that (1) is much more robust than the purely-supervised controller, especially with few demonstrations, and (2) avoids mode collapse, capturing many diverse behaviors when GAIL on its own does not. We demonstrate our approach on learning diverse gaits from demonstration on a 2D biped and a 62 DoF 3D humanoid in the MuJoCo physics environment.

研究动机与目标

使用 VAE 学习演示轨迹的语义嵌入空间，以实现策略的平滑插值。
将基于 VAE 的嵌入与有条件的 GAN 风格模仿目标相结合，以解决脆弱性和模式塌陷。
在 MuJoCo 的多自由度机器人上展示从少量示例中实现鲁棒、多样化行为模仿。
通过将新轨迹映射到学习到的嵌入空间来实现一次性模仿。
展示对高维躯体的可扩展性，如 62-DOF 人形机器人。

提出的方法

在演示序列上训练变分自编码器，使用双向 LSTM 编码器和两个解码器（动作和状态动力学）。
用 MLP 从（状态，嵌入）解码动作；通过基于 WaveNet 的状态模型自回归地解码下一个状态。
使用随机 VAE 获取潜在变量 z，并最小化重构损失以及对 p(z) 的 KL 散度。
通过对判别器进行 VAE 嵌入 z 的条件化并对 q(z|x) 求边际化来扩展 GAIL。
奖励 r(x,a|z) = -log(1 - Dψ(x,a|z))，并使用 TRPO 更新策略，采用固定的 VAE 先验以稳定学习。
将策略在 VAE 均值附近初始化，但训练一个有条件的策略，在 μθ(x,z) + μα(x,z) 周围的高斯分布下进行探索。

实验结果

研究问题

RQ1基于 VAE 的嵌入空间是否能够从演示中捕捉到语义上有意义、可插值的行为类别？
RQ2在 GAIL 中对 VAE 嵌入进行条件化是否能降低模式坍缩并提高学习到的行为的多样性？
RQ3该方法在不同躯体（手臂、步行者、 humanoid）上，凭借有限数量的演示，在多大程度上能够学习到鲁棒、多样的策略？
RQ4编码器是否能将新轨迹映射到嵌入空间以实现有效的一次性模仿？
RQ5该方法在像 62-DOF 人形这样的高维控制问题上扩展性如何？

主要发现

VAE 学习了一个结构化的嵌入空间，使得在展示的轨迹之间实现平滑的策略插值。
潜在空间的插值在 Jaco 手臂的任务空间中也对应插值。
带有 VAE 嵌入的条件判别器产生的模仿比纯 BC 或传统 GAIL 更鲁棒且多样。
对抗训练在具有多样风格和未见轨迹的二维步行器上提高了速度匹配和稳定性。
该方法在高维人形机器人上实现鲁棒模仿，并使跌倒率低于非自适应基线。
实证结果显示嵌入空间按运动速度聚类，并在行为之间存在有意义的转变。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。