QUICK REVIEW

[论文解读] Combining policy gradient and Q-learning

Brendan O’Donoghue, Rémi Munos|arXiv (Cornell University)|Nov 5, 2016

Reinforcement Learning in Robotics参考文献 28被引用 94

一句话总结

本文提出PGQL，一种新颖的强化学习算法，通过利用正则化策略梯度与Q值之间的不动点关系，将策略梯度与离策略Q-learning相结合。通过从策略动作偏好中估计Q值，并利用离策略Q-learning更新进行精炼，PGQL实现了更高的数据效率和稳定性，在完整的Atari套件上优于A3C和深度Q-learning，在随机起始条件下中位数人类归一化得分超过100%。

ABSTRACT

Policy gradient is an efficient technique for improving a policy in a reinforcement learning setting. However, vanilla online variants are on-policy only and not able to take advantage of off-policy data. In this paper we describe a new technique that combines policy gradient with off-policy Q-learning, drawing experience from a replay buffer. This is motivated by making a connection between the fixed points of the regularized policy gradient algorithm and the Q-values. This connection allows us to estimate the Q-values from the action preferences of the policy, to which we apply Q-learning updates. We refer to the new technique as 'PGQL', for policy gradient and Q-learning. We also establish an equivalency between action-value fitting techniques and actor-critic algorithms, showing that regularized policy gradient techniques can be interpreted as advantage function learning algorithms. We conclude with some numerical examples that demonstrate improved data efficiency and stability of PGQL. In particular, we tested PGQL on the full suite of Atari games and achieved performance exceeding that of both asynchronous advantage actor-critic (A3C) and Q-learning.

研究动机与目标

解决原始策略梯度方法在深度强化学习中数据效率低下和在线学习限制的问题。
通过建立正则化策略梯度与Q值之间的联系，使策略梯度框架能够实现离策略学习。
通过将Q-learning更新整合到策略梯度优化中，提升样本效率和训练稳定性。
通过Q值分解，证明正则化策略梯度方法可被解释为优势函数学习算法。
在Atari学习环境中，通过实证验证PGQL相较于SOTA方法（如A3C和深度Q-learning）的性能。

提出的方法

该方法从正则化策略梯度更新的不动点处策略的动作偏好中推导出Q值估计。
使用过去经验的回放缓冲区，对这些估计的Q值应用离策略Q-learning更新。
该算法采用双重更新机制：使用策略梯度更新改进策略，使用Q-learning更新精炼Q值。
Q值参数化采用双-stream网络架构，将Q值分解为状态值函数和优势函数。
通过超参数调度平衡策略梯度与Q-learning更新的学习率，其中Q-learning更新频率更高。
该方法使用深度神经网络实现，并采用共享策略与Q值网络架构，应用于Atari环境。

实验结果

研究问题

RQ1正则化策略梯度算法的不动点是否可用于估计与策略动作偏好一致的Q值？
RQ2将离策略Q-learning更新与策略梯度优化结合，是否能提升数据效率和训练稳定性？
RQ3通过Q值分解，正则化策略梯度方法是否可被解释为优势函数学习算法？
RQ4在Atari套件上，PGQL相较于A3C和深度Q-learning在性能和样本效率方面表现如何？
RQ5PGQL的失败模式是什么？是否可归因于局部最优或对早期数据的过拟合？

主要发现

PGQL在完整的Atari套件上表现更优，在57款游戏中有34款优于A3C和深度Q-learning。
在随机起始评估中，PGQL的平均归一化得分为人类基准的877.2%，中位数为145.6%。
在人类起始评估中，PGQL的平均得分为416.7%，中位数为103.3%，超过人类表现阈值（100%）。
PGQL的数据效率高于A3C和Q-learning，尤其在表现最佳的游戏上，样本训练轨迹显示其优势明显。
在PGQL表现不佳的情况下，常出现早期饱和或崩溃现象，表明可能存在对早期数据的过拟合或收敛至局部最优。
该方法表现出更高的稳定性和样本效率，仅有一款游戏PGQL表现最差，且在大多数情况下其排名介于其他两种方法之间。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。