QUICK REVIEW

[论文解读] SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards

Siddharth Reddy, Anca D. Dragan|arXiv (Cornell University)|May 27, 2019

Reinforcement Learning in Robotics参考文献 35被引用 53

一句话总结

SQIL 展示了一种简单的模仿学习方法，在离策略强化学习中使用常数奖励来实现长期模仿，而无需学习奖励函数，在多种任务上优于行为克隆并且与 GAIL 相当或竞争。

ABSTRACT

Learning to imitate expert behavior from demonstrations can be challenging, especially in environments with high-dimensional, continuous observations and unknown dynamics. Supervised learning methods based on behavioral cloning (BC) suffer from distribution shift: because the agent greedily imitates demonstrated actions, it can drift away from demonstrated states due to error accumulation. Recent methods based on reinforcement learning (RL), such as inverse RL and generative adversarial imitation learning (GAIL), overcome this issue by training an RL agent to match the demonstrations over a long horizon. Since the true reward function for the task is unknown, these methods learn a reward function from the demonstrations, often using complex and brittle approximation techniques that involve adversarial training. We propose a simple alternative that still uses RL, but does not require learning a reward function. The key idea is to provide the agent with an incentive to match the demonstrations over a long horizon, by encouraging it to return to demonstrated states upon encountering new, out-of-distribution states. We accomplish this by giving the agent a constant reward of r=+1 for matching the demonstrated action in a demonstrated state, and a constant reward of r=0 for all other behavior. Our method, which we call soft Q imitation learning (SQIL), can be implemented with a handful of minor modifications to any standard Q-learning or off-policy actor-critic algorithm. Theoretically, we show that SQIL can be interpreted as a regularized variant of BC that uses a sparsity prior to encourage long-horizon imitation. Empirically, we show that SQIL outperforms BC and achieves competitive results compared to GAIL, on a variety of image-based and low-dimensional tasks in Box2D, Atari, and MuJoCo.

研究动机与目标

在高维观测与未知动力学的环境中推动模仿学习，以避免 BC 固有的分布偏移。
提供一种简单的基于 RL 的模仿方法，不需要学习奖励函数。
证明常数奖励可以通过鼓励匹配示范状态并在分布外时返回它们来驱动长 horizon 的模仿。
将 SQIL 扩展为对标准 Q 学习或离策略算法的少量小改动即可实现。

提出的方法

用专家示例初始化回放缓存，并给示范转换分配一个常数奖励 r = +1。
添加带有奖励 r = 0 的新的代理交互数据，并将其追加到同一个回放缓冲区。
以示范与新经验各占 50/50 的混合方式抽样训练批次，以保持稳定的有效奖励。
用对示范和新经验的平方软贝尔曼误差优化软 Q 学习目标。
证明等价于一个正则化的行为克隆目标，该目标对隐式奖励施加稀疏性先验。
通过在离策略演员-评论家方法（如 SAC）之上应用，将 SQIL 扩展到连续动作。

实验结果

研究问题

RQ1一个纯 RL 方法，使用常数奖励，是否能够在不学习奖励函数的情况下实现长 horizon 的模仿？
RQ2SQIL 是否在没有对抗训练的情况下缓解 BC 固有的分布偏移问题？
RQ3SQIL 是否在图像基和低维任务上具有与 GAIL 相竞争的性能，同时保持实现简单？
RQ4将示范数据与环境交互整合，如何影响策略随时间的演进？
RQ5 SQIL 是否可以在使用离策略算法的连续控制设置中进行改造？

主要发现

SQIL 在所有测试任务中均优于行为克隆，尤其是在状态分布发生偏移时。
SQIL 在一系列图像基和低维环境上与 GAIL 结果具有竞争力。
SQIL 可以通过对标准离策略 RL 算法进行温和修改实现，不需要学习奖励函数。
SQIL 通过鼓励保持代理接近示范状态的行动，并通过以固定奖励回放示范，维持长 horizon 的模仿。
在连续控制方面，以 SAC 实例化的 SQIL 显示出强劲的性能，并且可以在仅有少量示例的情况下工作。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。