QUICK REVIEW

[论文解读] Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation

Yuhuai Wu, Elman Mansimov|arXiv (Cornell University)|Aug 17, 2017

Reinforcement Learning in Robotics参考文献 22被引用 470

一句话总结

ACKTR 将 Kronecker-factored 近似曲率与信任域自然梯度应用于行动者-评论者方法，在 Atari 和 MuJoCo 上实现了 2-3 倍的样本效率提升，并实现从原始像素输入进行学习。

ABSTRACT

In this work, we propose to apply trust region optimization to deep reinforcement learning using a recently proposed Kronecker-factored approximation to the curvature. We extend the framework of natural policy gradient and propose to optimize both the actor and the critic using Kronecker-factored approximate curvature (K-FAC) with trust region; hence we call our method Actor Critic using Kronecker-Factored Trust Region (ACKTR). To the best of our knowledge, this is the first scalable trust region natural gradient method for actor-critic methods. It is also a method that learns non-trivial tasks in continuous control as well as discrete control policies directly from raw pixel inputs. We tested our approach across discrete domains in Atari games as well as continuous domains in the MuJoCo environment. With the proposed methods, we are able to achieve higher rewards and a 2- to 3-fold improvement in sample efficiency on average, compared to previous state-of-the-art on-policy actor-critic methods. Code is available at https://github.com/openai/baselines

研究动机与目标

在深度强化学习中激发超越标准 SGD 更新的样本效率提升动机。
开发适用于大型 actor-critic 模型的可扩展自然梯度方法。
扩展 Kronecker-因子曲率以联合优化 actor 和 critic。
实现从原始像素输入直接学习，适用于离散和连续控制任务。

提出的方法

使用 Kronecker-factored 近似曲率（K-FAC）高效反转 Fisher 矩阵以进行自然梯度更新。
对 actor 和 critic 应用带有信任域约束的自然梯度（对 critic 使用高斯-牛顿）。
构建一个联合的、可选共享的 actor-critic 架构，并在需要时独立采样输出。
引入分解的 Tikhonov 阻尼和异步统计/逆矩阵以降低计算量。
使用信任域形式调整步长以限制更新中的 KL 散度。

实验结果

研究问题

RQ1在样本效率和计算效率方面，ACKTR 与最先进的 on-policy 方法和二阶基线相比如何？
RQ2在 actor 和 critic 同时应用自然梯度更新对稳定性和性能有何影响？
RQ3在离散和连续控制中，ACKTR 如何随着批量大小和输入模态（包括像素输入）扩展？
RQ4哪种范数和阻尼策略用于 critic 最优化能最好地稳定训练并提高样本效率？

主要发现

ACKTR 在 Atari 和 MuJoCo 基准测试上显著提高了样本效率和最终性能，相较于 A2C 和 TRPO。
同时对 actor 和 critic 使用自然梯度更新带来可扩展的性能提升，是以往方法无法实现的。
对 critic 使用基于高斯-牛顿的范数相比欧几里得范数更新，在样本效率和训练稳定性方面带来显著提升。
ACKTR 的计算成本接近基于 SGD 的方法，每次更新成本仅略高。
在连续控制任务中从像素输入学习时，ACKTR 展示出强劲的性能，包括来自原始像素观测的竞争结果。
更大的批量大小比一阶方法更有利于 ACKTR，表明在分布式环境中具有显著加速潜力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。