QUICK REVIEW

[论文解读] Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction

Wen Sun, Arun Venkatraman|arXiv (Cornell University)|Mar 3, 2017

Reinforcement Learning in Robotics参考文献 35被引用 84

一句话总结

AggreVaTeD 将 AggreVaTe 扩展为用于序列预测的可微分模仿学习，使用在线梯度和自然梯度更新，借助一个 oracle 实现更快、更数据高效的学习，适用于序列任务中的深度模型。

ABSTRACT

Researchers have demonstrated state-of-the-art performance in sequential decision making problems (e.g., robotics control, sequential prediction) with deep neural network models. One often has access to near-optimal oracles that achieve good performance on the task during training. We demonstrate that AggreVaTeD --- a policy gradient extension of the Imitation Learning (IL) approach of (Ross & Bagnell, 2014) --- can leverage such an oracle to achieve faster and better solutions with less training data than a less-informed Reinforcement Learning (RL) technique. Using both feedforward and recurrent neural network predictors, we present stochastic gradient procedures on a sequential prediction task, dependency-parsing from raw image data, as well as on various high dimensional robotics control problems. We also provide a comprehensive theoretical study of IL that demonstrates we can expect up to exponentially lower sample complexity for learning with AggreVaTeD than with RL algorithms, which backs our empirical findings. Our results and theory indicate that the proposed approach can achieve superior performance with respect to the oracle when the demonstrator is sub-optimal.

研究动机与目标

利用提供未来成本的 oracle 以加速序列任务中的学习。
开发与深度神经策略（包括 LSTM）兼容的可微分IL方法。
提供基于梯度的训练程序（常规梯度和自然梯度）及理论支撑。
在机器人控制和序列预测任务上展示经验性能提升。
提供样本高效的 IL 保证，并在离散 MDP 中与 RL 进行比较。

提出的方法

将 IL 表述为一个可微分的策略梯度问题，使用 Ross & Bagnell（2014）的在线无遗憾学习还原。
Derive loss ell_n(pi) = (1/H) sum_t E_{s_t~d_t^{pi_n}} E_{a~pi(.|s_t)}[Q_t^*(s_t,a)].
Compute gradients for discrete actions via Eq. (3) and use importance weighting for continuous actions via Eq. (4).
Provide online gradient descent (OGD) and exponential gradient (EG) with Fisher information-based natural gradient updates (Eq. 8, 9).
Introduce AggreVaTeD with mixing of expert and learner policies and a decaying mixing rate alpha_n (Algorithm 1).
Implement variance-reduced gradient estimators (Eq. 12, 13) and a CG-based efficient natural gradient step.
Use an approximate Fisher matrix S_n S_n^T to enable scalable updates (CG solution).

实验结果

研究问题

RQ1带有 oracle 的可微分模仿学习是否能在序列预测任务中实现比标准 RL 更快、数据更高效的学习？
RQ2在使用深度模型时，在线梯度和自然梯度变体的 AggreVaTeD 在样本效率和最终性能方面的比较如何？
RQ3通过 LSTM 对部分可观测输入是否有效？
RQ4在离散 MDP 中，相对于 RL，IL 的效率存在何种理论保证，包括在 Q* 估计有噪声的情况下？
RQ5在高维机器人控制和序列预测基准测试中，AggreVaTeD 相对于 RL 和以往的 IL 方法的表现如何？

主要发现

方法	UAS / 奖励	方差	备注
AggreVaTeD (LSTM)	0.924 ± 0.10	–	Dependency parsing (Handwritten Algebra)
AggreVaTeD (NN)	0.851 ± 0.10	–	Dependency parsing (Handwritten Algebra)
SL-RL (LSTM)	0.826 ± 0.09	–	Supervised-like RL baseline
SL-RL (NN)	0.386 ± 0.10	–	Supervised-like RL baseline
RL (LSTM)	0.256 ± 0.07	–	Reinforcement learning baseline
RL (NN)	0.227 ± 0.06	–	Reinforcement learning baseline
DAGGER	0.832 ± 0.02	–	Imitation learning baseline
SL (LSTM)	0.813 ± 0.10	–	Supervised learning baseline
SL (NN)	0.325 ± 0.20	–	Supervised learning baseline
Random	~0.150	–	Random policy

当 oracle 次优时，AggreVaTeD 可达到专家级或超专家水平，超越非交互式 IL 方法。
自然梯度更新在多数任务中通常比常规梯度更新带来更快且更稳健的提升。
在机器人任务（CartPole、Acrobot、Walker、Hopper）上，AggreVaTeD 在学习速度和最终奖励方面显著超越基线 RL。
在依存句法分析的部分可观测设定中（LSTM 策略），AggreVaTeD 达到了专家性能的 92%，而 RL 遭遇困难。
依存句法分析实验表明，使用 LSTMs 的 AggreVaTeD 达到 0.924 UAS (±0.10)，优于多种基线，显示出可微分 IL 的强大增益。
理论结果表明，在一个构造的 MDP 上，IL 的样本复杂度可能比 RL 指数级更低，在一般离散 MDP 中具有多项式保证。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。