QUICK REVIEW

[论文解读] Imitation Learning as $f$-Divergence Minimization

Liyiming Ke, Sanjiban Choudhury|arXiv (Cornell University)|May 30, 2019

Anomaly Detection Techniques and Applications被引用 39

一句话总结

本论文将 imitation learning（模仿学习）统一为最小化学习者与专家轨迹分布之间的 f-divergences，显示 reverse KL 在多模态演示下是 mode-seeking，并且在某些任务中可胜过 mode-covering KL/JS；并将 BC、GAIL、DAgger 作为特例。

ABSTRACT

We address the problem of imitation learning with multi-modal demonstrations. Instead of attempting to learn all modes, we argue that in many tasks it is sufficient to imitate any one of them. We show that the state-of-the-art methods such as GAIL and behavior cloning, due to their choice of loss function, often incorrectly interpolate between such modes. Our key insight is to minimize the right divergence between the learner and the expert state-action distributions, namely the reverse KL divergence or I-projection. We propose a general imitation learning framework for estimating and minimizing any f-Divergence. By plugging in different divergences, we are able to recover existing algorithms such as Behavior Cloning (Kullback-Leibler), GAIL (Jensen Shannon) and Dagger (Total Variation). Empirical results show that our approximate I-projection technique is able to imitate multi-modal behaviors more reliably than GAIL and behavior cloning.

研究动机与目标

从多模态演示出发，动机是没有单一模式是首选的情形下的模仿学习。
提出一个统一的 f-divergence 最小化框架来覆盖现有的 IL 方法。
开发估计器以使用轨迹分布或状态-动作级分布来最小化学习者与专家分布之间的 f-divergences。
强调 reverse KL（mode-seeking）在安全处理多模态演示方面的优点。

提出的方法

将 IL 正式化为最小化 learner 与 expert 轨迹分布之间的 D_f。
证明在 learner 与 expert 的平均状态-动作分布之间最小化 D_f 可以为轨迹发散度下界（Theorem 3.1）。
引入一个变分下界，以使用一个判别器式函数（phi）和凸共轭（f*）来估计 D_f。
提出 Algorithm f–VIM，它在所选 f-散度下对 learner（策略）和判别器进行鞍点优化。
证明 KL-VIM、RKL-VIM、JS-VIM 分别对应 BC、GAIL 及相关方法作为特例。

实验结果

研究问题

RQ1最小化轨迹分布之间的 f-divergences 是否能在多模态专家演示下实现鲁棒的模仿学习？
RQ2不同的 f-divergences（KL、JS、TV、reverse KL）如何影响 IL 中的模式覆盖与模式崩塌行为？
RQ3一个统一的变分框架是否能回收并连接现有的 IL 方法（BC、GAIL、DAgger），并为多模态数据提供实际优势？
RQ4在现实/连续域使用 reverse KL 进行 IL 时，实际估计与评估需要考虑哪些因素？

主要发现

Reverse KL（I-projection）是 mode-seeking，并倾向于收敛到演示者模式的一部分子集，可能在多模态任务中提高安全性与可靠性。
KL 与 JS 是 mode-covering，能够在模式之间插值，可能在某些设置下导致不安全或不期望的行为。
f–VIM 框架通过不同的 f-divergences 将 Behavior Cloning（KL）、GAIL（JS）和 DAgger（TV）作为特例进行包含。
在高维连续任务中，RKL–VIM 在某些环境（如 MuJoCo）可以获得比 JS–VIM/GAIL 更高的渐近回报，并观察到判别器的强调点差异。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。