QUICK REVIEW

[论文解读] Primal Wasserstein Imitation Learning

Robert Dadashi, Léonard Hussenot|arXiv (Cornell University)|Jun 8, 2020

Reinforcement Learning in Robotics参考文献 61被引用 41

一句话总结

PWIL 将专家和代理状态-动作分布之间的 Wasserstein 距离最小化，使用原始 Wasserstein 形式，推导离线奖励以在少量演示下实现近专家模仿且无需 minmax 训练。

ABSTRACT

Imitation Learning (IL) methods seek to match the behavior of an agent with that of an expert. In the present work, we propose a new IL method based on a conceptually simple algorithm: Primal Wasserstein Imitation Learning (PWIL), which ties to the primal form of the Wasserstein distance between the expert and the agent state-action distributions. We present a reward function which is derived offline, as opposed to recent adversarial IL algorithms that learn a reward function through interactions with the environment, and which requires little fine-tuning. We show that we can recover expert behavior on a variety of continuous control tasks of the MuJoCo domain in a sample efficient manner in terms of agent interactions and of expert interactions with the environment. Finally, we show that the behavior of the agent we train matches the behavior of the expert with the Wasserstein distance, rather than the commonly used proxy of performance.

研究动机与目标

在奖励信号难以指定或稀疏时，激励模仿学习。
提出一个基于距离的原理性目标，使用状态-动作分布之间的原始 Wasserstein 距离。
从原始 Wasserstein 距离的上界推导离线奖励函数，以引导学习。
在连续控制任务上展示样本高效地恢复专家行为，包括具有挑战性的 Humanoid 场景。
展示对基于特征和基于像素（视觉）观测的适用性。

提出的方法

将模仿建模为最小化代理和专家经验状态-动作分布之间的 1-Wasserstein 距离。
引入贪婪耦合以获得 Wasserstein 距离的在线可计算上界。
将 episodic、历史相关的奖励 r 定义为对贪婪成本 c_i 的单调函数，该成本从到专家转变的距离计算得出。
给出一个算法 PWIL，使用离线推导的奖励与通用强化学习代理配合。
对拼接后的状态-动作向量使用标准化欧氏距离定义距离 d(.)，该向量可以是学习得到的，或从像素学习得到的。
通过避免对抗性 IL 方法中典型的 minmax 训练循环来展示可扩展性和稳定性。

实验结果

研究问题

RQ1在 MuJoCo 劳动/ locomotion 任务中，随着演示数量变化，PWIL 能否恢复专家行为？
RQ2相对于 DAC 和 BC 等最新的 IL 方法，PWIL 的样本效率如何？
RQ3PWIL 是否真的最小化专家与代理状态-动作分布之间的 Wasserstein 距离？
RQ4PWIL 能否扩展到像素/视觉观测，在其中 MDP 度量离线学习？
RQ5消融实验对 PWIL 重现专家行为能力的影响是什么？

主要发现

PWIL 在包括 Humanoid 的多个 MuJoCo 任务上实现接近专家的性能，甚至仅凭一个示范。
相对于 DAC，在不同环境中，PWIL 展示出对专家的 Wasserstein 距离最小化具有竞争力或更优，在大多数情况下距距离下降更明显。
离线推导的奖励函数仅需两个超参数，且奖励形式简单，减少了调参负担。
消融研究表明，像基于动作的匹配（PWIL-state）和适当的MD度量权重等组件对性能至关重要，“pop-outs” 对恢复完整专家行为也很重要。
PWIL 通过在嵌入空间学习距离（通过 Temporal Cycle-Consistency Learning）扩展到基于像素的观测，并在开门场景中仍能恢复任务成功。
PWIL 展现出强烈的样本效率，特别是在仅用少量示范就获得 Humanoid 的有意义分数，并且在不同随机种子和环境中具有鲁棒性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。