QUICK REVIEW

[论文解读] Neural probabilistic motor primitives for humanoid control

Josh Merel, Leonard Hasenclever|arXiv (Cornell University)|Nov 28, 2018

Motor Control and Adaptation被引用 89

一句话总结

本文提出 neural probabilistic motor primitives，一种离线训练的运动模块，将成千上万的人形专家技能压缩到潜在空间，使一次性模仿和由更高层控制器重用成为可能。它比较离线迁移中的行为克隆和线性反馈策略克隆（LFPC）。

ABSTRACT

We focus on the problem of learning a single motor module that can flexibly express a range of behaviors for the control of high-dimensional physically simulated humanoids. To do this, we propose a motor architecture that has the general structure of an inverse model with a latent-variable bottleneck. We show that it is possible to train this model entirely offline to compress thousands of expert policies and learn a motor primitive embedding space. The trained neural probabilistic motor primitive system can perform one-shot imitation of whole-body humanoid behaviors, robustly mimicking unseen trajectories. Additionally, we demonstrate that it is also straightforward to train controllers to reuse the learned motor primitive space to solve tasks, and the resulting movements are relatively naturalistic. To support the training of our model, we compare two approaches for offline policy cloning, including an experience efficient method which we call linear feedback policy cloning. We encourage readers to view a supplementary video ( https://youtu.be/CaDEf-QcKwA ) summarizing our results.

研究动机与目标

开发能够表示和生成大量人形运动技能的运动原语模块。
在紧凑的嵌入空间内实现一次性模仿和技能的灵活重用。
通过利用来自专家演示的离线策略迁移，避免广泛的在线强化学习。
比较两种离线迁移方法：行为克隆和线性反馈策略克隆（LFPC）。
展示学习到的原语在跨任务和未见轨迹下的鲁棒性、自然性和可迁移性。

提出的方法

提出在每个时间步具有潜变量 z_t 的自回归潜变量模型，条件化动作分布 p(a_t|s_t,z_t)。
对短期前瞻轨迹片段 x_t 进行编码以训练编码器 q(z_t|z_{t-1},x_t) 和解码器 π(a_t|s_t,z_t)。
对 z_t 使用 AR(1) 先验以鼓励时间一致性并通过 beta 加权的 ELBO 目标压缩信息。
通过专家轨迹（2707 条片段）进行离线监督学习训练，实现一次性模仿而无需在线 RL。
引入两种离线迁移方案：（a）来自带噪声专家滚动回放的行为克隆，以及（b）使用动作-状态雅可比矩阵的线性反馈策略克隆（LFPC）以对邻近状态实现鲁棒性。
将 LFPC 的目标函数通过加入扰动和基于雅可比矩阵的修正来调整似然项和 KL 项。

实验结果

研究问题

RQ1单一的神经概率运动原语模块是否可以将成千上万的专家人形技能压缩到可用的嵌入空间？
RQ2是否可以使用离线训练的原语实现一次性模仿和对未见轨迹的鲁棒再现？
RQ3在离线迁移方面，行为克隆和 LFPC 在数据效率和性能方面有何比较？
RQ4学习到的原语是否可以被更高层控制器重复使用来解决具有自然运动的新任务？
RQ5潜在空间结构如何影响对扰动的鲁棒性以及对未见行为的泛化？

主要发现

运动原语模块可以将成千上万的专家策略压缩到学习得到的嵌入空间。
在某些正则化设置下，使用单条轨迹的 LFPC 实现的一次性模仿可以达到使用数百条轨迹的行为克隆的水平。
正则化和更大的潜在空间提高了模仿性能和鲁棒性。
学习到的原语空间使高层策略能够利用该空间在稀疏奖励任务中实现类似人类运动的再利用。
对潜在序列的优化可以提升边界轨迹的一次性模仿，表明潜在表示具有意义。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。