QUICK REVIEW

[论文解读] SQIL: Imitation Learning via Regularized Behavioral Cloning.

Siddharth Reddy, Anca D. Dragan|arXiv (Cornell University)|May 27, 2019

Reinforcement Learning in Robotics参考文献 13被引用 33

一句话总结

本文提出软Q模仿学习（SQIL），一种无需奖励函数的模仿学习方法，通过为示范的状态-动作对赋予恒定的+1奖励，其余情况为0，从而在不学习奖励函数的前提下，实现稳定且长时程的模仿，避免了策略漂移。SQIL在基于图像和连续控制的任务中，性能与GAIL相当，优于行为克隆（BC）。

ABSTRACT

Learning to imitate expert behavior from demonstrations can be challenging, especially in environments with high-dimensional, continuous observations and unknown dynamics. Supervised learning methods based on behavioral cloning (BC) suffer from distribution shift: because the agent greedily imitates demonstrated actions, it can drift away from demonstrated states due to error accumulation. Recent methods based on reinforcement learning (RL), such as inverse RL and generative adversarial imitation learning (GAIL), overcome this issue by training an RL agent to match the demonstrations over a long horizon. Since the true reward function for the task is unknown, these methods learn a reward function from the demonstrations, often using complex and brittle approximation techniques that involve adversarial training. We propose a simple alternative that still uses RL, but does not require learning a reward function. The key idea is to provide the agent with an incentive to match the demonstrations over a long horizon, by encouraging it to return to demonstrated states upon encountering new, out-of-distribution states. We accomplish this by giving the agent a constant reward of r=+1 for matching the demonstrated action in a demonstrated state, and a constant reward of r=0 for all other behavior. Our method, which we call soft Q imitation learning (SQIL), can be implemented with a handful of minor modifications to any standard Q-learning or off-policy actor-critic algorithm. Theoretically, we show that SQIL can be interpreted as a regularized variant of BC that uses a sparsity prior to encourage long-horizon imitation. Empirically, we show that SQIL outperforms BC and achieves competitive results compared to GAIL, on a variety of image-based and low-dimensional tasks in Box2D, Atari, and MuJoCo.

研究动机与目标

解决行为克隆中的分布偏移问题，即在高维连续观测空间中因误差累积导致的策略漂移。
消除模仿学习中复杂奖励函数学习的需求，避免脆弱的对抗性训练和奖励近似。
开发一种仅依赖示范数据和标准的异策略强化学习算法，实现长时程模仿的方法。
提供一种理论基础扎实、正则化的行为克隆变体，鼓励状态-动作分布匹配。
在性能上达到与SOTA方法（如GAIL）相当的水平，同时更加简单且稳定。

提出的方法

当智能体的状态-动作对与示范的状态-动作对匹配时，SQIL 使用恒定的奖励 r = +1，否则 r = 0。
该方法将此奖励信号整合到标准的Q学习或异策略演员-critic框架中，仅需少量修改。
该奖励信号起到稀疏性先验的作用，促使策略返回到示范状态，减少分布偏移。
通过避免对抗性训练和奖励函数推断，SQIL 简化了训练过程，同时保持性能。
该方法可被解释为一种正则化的行为克隆形式，其中正则化项鼓励长时程模仿。
该方法适用于离散和连续控制任务，并可处理基于图像的观测。

实验结果

研究问题

RQ1一种简单且无奖励的模仿学习方法，是否能在高维连续控制环境中超越标准行为克隆？
RQ2对示范状态-动作对使用恒定奖励信号，是否能有效减少分布偏移并提升长时程性能？
RQ3SQIL 是否能在不依赖对抗性奖励学习或复杂奖励函数近似的情况下，达到与GAIL相当的性能？
RQ4SQIL 在包括Box2D、Atari和MuJoCo在内的多样化环境中，与BC和GAIL相比表现如何？
RQ5SQIL 在不同观测模态（包括像素输入）的任务中是否具备鲁棒性和泛化能力？

主要发现

SQIL 在所有评估任务中均优于标准行为克隆，表现出更少的策略漂移和更高的样本效率。
SQIL 在Box2D、Atari和MuJoCo等基于图像和低维控制任务中，性能与GAIL相当。
在高维观测环境（如像素输入）中，SQIL 相较于BC表现出一致的性能提升。
SQIL 无需对抗性训练或奖励函数学习，训练过程更简单且更稳定。
恒定奖励机制有效促使智能体返回示范状态，缓解了分布偏移问题。
实证结果表明，SQIL 在包括离散和连续控制设置在内的多样化环境中均表现出良好的泛化能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。