QUICK REVIEW

[论文解读] Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction

Wen Sun, Arun Venkatraman|arXiv (Cornell University)|Mar 2, 2017

Reinforcement Learning in Robotics参考文献 28被引用 91

一句话总结

AggreVaTeD 是一种可微分的模仿学习方法，利用一个 oracle（cost-to-go）来学习用于序列预测和高维控制的策略，在有时也能比强化学习获得更快且通常更优的性能，即使 oracle 是子最优的。

ABSTRACT

Researchers have demonstrated state-of-the-art performance in sequential decision making problems (e.g., robotics control, sequential prediction) with deep neural network models. One often has access to near-optimal oracles that achieve good performance on the task during training. We demonstrate that AggreVaTeD --- a policy gradient extension of the Imitation Learning (IL) approach of (Ross & Bagnell, 2014) --- can leverage such an oracle to achieve faster and better solutions with less training data than a less-informed Reinforcement Learning (RL) technique. Using both feedforward and recurrent neural network predictors, we present stochastic gradient procedures on a sequential prediction task, dependency-parsing from raw image data, as well as on various high dimensional robotics control problems. We also provide a comprehensive theoretical study of IL that demonstrates we can expect up to exponentially lower sample complexity for learning with AggreVaTeD than with RL algorithms, which backs our empirical findings. Our results and theory indicate that the proposed approach can achieve superior performance with respect to the oracle when the demonstrator is sub-optimal.

研究动机与目标

通过在训练中利用接近最优的 cost-to-go oracle 来提高序列决策问题的样本效率和性能。
将模仿学习扩展到复杂的高维模型（如深度网络、LSTM）用于序列预测任务。
提供可扩展到大型函数近似器的在线梯度更新和自然梯度更新。
从理论上分析IL与RL，以在利用 Q* 的前提下展示样本效率的指数级或多项式提升潜力。

提出的方法

将 IL 表示为一个在线学习问题，其损失为无后悔（no-regret），在当前策略诱导的状态分布下使用专家的 cost-to-go Q*（见 Eq. 1）。
给出两类梯度更新：常规的 Online Gradient Descent (OGD) 和 Exponential Gradient (EG)，从而得到自然梯度方法。
给出离散与连续动作的实用梯度表达式（Eq. 3、Eq. 4、Eq. 5；EG 的 Eq. 6 和 Eq. 7）。
引入可微分的 AggreVaTeD（Alg. 1），采用专家与学习者滚入的衰减混合来训练表达能力强的策略（如神经网络、LSTM）。
描述通过低秩表示和共轭梯度来计算下降方向的基于 Fisher 信息的高效自然梯度更新。
提供方差化简的梯度估计（Eq. 12、Eq. 13）以及梯度和 Fisher 矩阵的基于样本的近似（Eq. 14）。

实验结果

研究问题

RQ1带有 oracle 的可微分模仿学习是否能在序列预测与控制中超越传统 RL？
RQ2在在线学习更新中利用专家的成本到达 Q* 能带来多少样本效率的提升？
RQ3AggreVaTeD 是否能够扩展到深度架构和部分可观测设置（例如使用 LSTM）同时保持性能提升？
RQ4在离散 MDP 的遗憾界限与样本复杂性方面，IL 相对于 RL 的理论极限是什么？
RQ5在高维任务上，不同的更新方案（常规梯度 vs 自然梯度）在实际中的表现差异如何？

主要发现

在可微分的表述和一个 oracle 的条件下，当 oracle 为子最优时，AggreVaTeD 可以达到专家水平甚至超越专家的性能（基于经验结果）。
在机器人仿真中，自然梯度 AggreVaTeD 分别在 Acrobot 超越专家 5.8%，在 Cart-pole 超越 25%。
使用 LSTM 策略的 AggreVaTeD 在部分可观测环境中仍然有效，而 RL 无法改善。
在连续动作任务（Walker、Hopper）中，AggreVaTeD 在 Walker 相较于专家提升 5.4%，在 Hopper 达到专家表现的 97%。
依存句法分析实验显示，结合 LSTMs 和 NN 策略的 AggreVaTeD 与 RL 基线和监督学习基线相比，在 UAS 指标上具有竞争力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。