QUICK REVIEW

[论文解读] Relative Entropy Regularized Policy Iteration

Abbas Abdolmaleki, Jost Tobias Springenberg|arXiv (Cornell University)|Dec 5, 2018

Reinforcement Learning in Robotics参考文献 35被引用 45

一句话总结

一种离策略的演员-评论家强化学习方法，交替进行 Q 值估计、带 KL 正则化的局部无参数策略改进，以及带解耦高斯更新的参数化策略拟合，在多个连续控制基准测试中取得了强劲的结果。

ABSTRACT

We present an off-policy actor-critic algorithm for Reinforcement Learning (RL) that combines ideas from gradient-free optimization via stochastic search with learned action-value function. The result is a simple procedure consisting of three steps: i) policy evaluation by estimating a parametric action-value function; ii) policy improvement via the estimation of a local non-parametric policy; and iii) generalization by fitting a parametric policy. Each step can be implemented in different ways, giving rise to several algorithm variants. Our algorithm draws on connections to existing literature on black-box optimization and 'RL as an inference' and it can be seen either as an extension of the Maximum a Posteriori Policy Optimisation algorithm (MPO) [Abdolmaleki et al., 2018a], or as an extension of Trust Region Covariance Matrix Adaptation Evolutionary Strategy (CMA-ES) [Abdolmaleki et al., 2017b; Hansen et al., 1997] to a policy iteration scheme. Our comparison on 31 continuous control tasks from parkour suite [Heess et al., 2017], DeepMind control suite [Tassa et al., 2018] and OpenAI Gym [Brockman et al., 2016] with diverse properties, limited amount of compute and a single set of hyperparameters, demonstrate the effectiveness of our method and the state of art results. Videos, summarizing results, can be found at goo.gl/HtvJKR .

研究动机与目标

开发一个数据高效的离策略演员-评论家框架，用于连续控制。
将 Q 函数估计与局部非参数策略改进步骤结合起来。
引入带 KL 基于正则化的参数化策略拟合步骤，以确保稳定学习。
实现高斯策略的均值和协方差解耦更新，以防止过早收敛。
在多样的基准测试中展示对单一超参数集的鲁棒性。

提出的方法

通过用 TD 学习和目标网络学习参数化的 Q 函数来进行策略评估。
通过使用 Q 值对样本进行重加权，构建局部非参数动作分布来进行策略改进。
通过带 KL 正则化的加权最大似然（基于 softmax 的权重）将其投影回参数化策略。
可选地通过指数变换或基于排序的方案转换权重；通过凸对偶问题求解温度参数。
拟合改进后的高斯策略，均值和协方差解耦更新，以防止过早收敛。
通过对均值和协方差施加 KL 约束来控制策略更新；使用坐标上升优化。

实验结果

研究问题

RQ1在离策略演员-评论家设置中，KL 正则化的策略改进如何影响稳定性与性能？
RQ2高斯策略的均值/协方差解耦更新是否能提高学习稳定性并防止过早收敛？
RQ3在多样的连续控制任务（Control Suite、Parkour、OpenAI Gym）上使用单一超参数集时，该框架的表现如何？
RQ4不同 Q 函数估计策略（如 TD0 与 Retrace）对复杂任务的最终性能有何影响？
RQ5在高维任务中，该方法与 DDPG、SVG、SAC 等基线相比有何表现？

主要发现

该方法在 31 个连续控制任务上，在多项基准测试中使用单一超参数集合即可实现出色表现。
对高斯策略进行均值/协方差解耦更新有助于避免过早收敛并提升稳定性与性能。
对均值和协方差两者的 KL 约束对于跨任务的可靠学习很重要；没有它们，学习可能不稳定。
基于 Retrace 的策略评估在具有挑战性的 Parkour 任务中相较于 TD0 能加速学习。
在 OpenAI Gym 任务中，该方法的渐近性能高于 SAC，且样本效率相当。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。