QUICK REVIEW

[论文解读] Model-Augmented Actor-Critic: Backpropagating through Paths

Ignasi Clavera, Violet Fu|arXiv (Cornell University)|May 16, 2020

Reinforcement Learning in Robotics参考文献 34被引用 40

一句话总结

MAAC 将梯度通过一个可微学习模型在未来若干步中反向传播，使用一个终值来稳定长时域训练，与最先进的基于模型和无模型的强化学习方法相比，获得更高的数据效率和具竞争力的渐进性能。

ABSTRACT

Current model-based reinforcement learning approaches use the model simply as a learned black-box simulator to augment the data for policy optimization or value function learning. In this paper, we show how to make more effective use of the model by exploiting its differentiability. We construct a policy optimization algorithm that uses the pathwise derivative of the learned model and policy across future timesteps. Instabilities of learning across many timesteps are prevented by using a terminal value function, learning the policy in an actor-critic fashion. Furthermore, we present a derivation on the monotonic improvement of our objective in terms of the gradient error in the model and value function. We show that our approach (i) is consistently more sample efficient than existing state-of-the-art model-based algorithms, (ii) matches the asymptotic performance of model-free algorithms, and (iii) scales to long horizons, a regime where typically past model-based approaches have struggled.

研究动机与目标

激励并开发一种基于模型的策略优化方法，利用学习到的动力学的可微性。
在保持或达到与模型无关方法相同的渐近性能的同时，降低样本复杂度。
在 actor-critic 框架中使用一个终止值函数来稳定长时域的学习。
提供理论保证，将梯度误差与模型和价值函数近似误差联系起来。

提出的方法

提出一个模型增强的 actor-critic 目标，在学习到的模型中对 H 步进行反向传播：J_pi(theta)=E[ sum_{t=0}^{H-1} gamma^t r(s_t) + gamma^H Q_hat(s_H, a_H) ].
使用路径导数（重参数化）来计算通过可微分模型和策略的梯度。
用一个终端 Q 函数防止梯度不稳定，并将 H 视作一个平衡基于模型与基于模型无关信号的水平 horizon 超参数。
训练一个自举集成的动力学模型以在最大似然训练的同时捕捉认知不确定性和本体不确定性。
学习两个 Q 函数以稳定价值估计，并使用 SEVE 风格的目标来进行价值学习。

实验结果

研究问题

RQ1在样本效率和渐近性能方面，MAAC 是否超过了最先进的基于模型和无模型的基线？
RQ2MAAC 的梯度误差如何与模型与 Q 函数导数误差以及 horizon H 相关？
RQ3通过模型进行反向传播对性能是否是必需的，以及在测试时进行规划（MPC）对结果有何影响？
RQ4在考虑模型和函数近似误差时，MAAC 是否能提供单调提升的保证？
RQ5使用模型集成和 STEVE 风格目标对训练稳定性和性能有何影响？

主要发现

Environment	MAAC+MPC	MAAC
AntEnv	3.97e3 ± 1.48e3	3.06e3 ± 1.45e3
HalfCheetahEnv	1.09e4 ± 9.45e1	1.07e4 ± 2.53e2
HopperEnv	2.8e3 ± 1.1e1	2.77e3 ± 3.31e0
Walker2dEnv	1.76e3 ± 7.8e1	1.61e3 ± 4.04e2

与 MBPO、STEVE、SVG(1) 和 SAC 相比，MAAC 在四个 MuJoCo 基准测试中实现了更优的样本效率和渐近性能。
梯度误差行为符合理论界限：较短的 horizon 可降低模型导数误差，而较长的 horizon 会放大它。
消融实验显示通过模型进行反向传播（非零 horizon H）对强劲表现至关重要；STEVE 目标有助于稳定性但作用较小。
在测试时使用 MPC 微调步骤在更难的任务中带来额外的性能提升，尽管在简单环境中的提升更小。
具有模型集成和终值函数的 MAAC 能有效缓解模型偏差并实现更长时域的规划，同时不牺牲稳定性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。