QUICK REVIEW

[论文解读] Addressing Function Approximation Error in Actor-Critic Methods

Scott Fujimoto, Herke van Hoof|arXiv (Cornell University)|Feb 26, 2018

Reinforcement Learning in Robotics参考文献 39被引用 2,362

一句话总结

论文在 Actor-Critic 方法中发现了过估计偏差，并引入 TD3，一组技术（剪裁双 Q 学习、延迟策略更新、以及目标策略平滑）以降低偏差和方差，在 OpenAI Gym 连续控制任务上取得优越的表现。

ABSTRACT

In value-based reinforcement learning methods such as deep Q-learning, function approximation errors are known to lead to overestimated value estimates and suboptimal policies. We show that this problem persists in an actor-critic setting and propose novel mechanisms to minimize its effects on both the actor and the critic. Our algorithm builds on Double Q-learning, by taking the minimum value between a pair of critics to limit overestimation. We draw the connection between target networks and overestimation bias, and suggest delaying policy updates to reduce per-update error and further improve performance. We evaluate our method on the suite of OpenAI gym tasks, outperforming the state of the art in every environment tested.

研究动机与目标

证明在演员-评论家方法中存在过estimation 偏差和高方差，并且它们会损害学习。
将 Double Q-learning 适配并扩展到演员-评论家框架，以减少偏差。
开发机制（目标网络、延迟策略更新和策略平滑）以降低方差并提高稳定性。
在七个 OpenAI Gym 连续控制任务上进行经验验证，并与基线进行比较。

提出的方法

通过对目标计算取两个独立 critic 的最小值来引入剪裁双Q学习。
使用两个独立的 critic 和两个独立的 actor 及相应的目标，以减少 actor 与 critic 更新之间的耦合。
将策略更新相对于 critic 更新延迟，以在策略优化前让价值估计收敛。
通过对目标动作添加剪裁噪声来实现目标策略平滑，以降低目标方差。
维持缓慢更新的目标网络，以稳定学习并减少每次更新的误差。
在 MuJoCo 连续控制任务上进行评估，并与 DDPG、PPO、TRPO、ACKTR 和 SAC 进行比较。

实验结果

研究问题

RQ1在带函数逼近的演员-评论家方法中，是否会出现过估计偏差和高方差的时序差分误差？
RQ2通过剪裁双Q学习对Q值估计进行裁剪，是否能在演员-评论家设置中降低过估计偏差？
RQ3目标网络、延迟策略更新和目标策略平滑是否提高连续控制任务的稳定性和性能？

主要发现

环境	TD3	DDPG	Our DDPG	PPO	TRPO	ACKTR	SAC
HalfCheetah	9636.95 ± 859.065	3305.60	8577.29	1795.43	-15.57	1450.46	2347.19
Hopper	3564.07 ± 114.74	2020.46	1860.02	2164.70	2471.30	2428.39	2996.66
Walker2d	4682.82 ± 539.64	1843.85	3098.11	3317.69	2321.47	1216.70	1283.67
Ant	4372.44 ± 1000.33	1005.30	888.77	1083.20	-75.85	1821.94	655.35
Reacher	-3.60 ± 0.56	-6.51	-4.01	-6.18	-111.43	-4.26	-4.44
InvPendulum	1000.00 ± 0.00	1000.00	1000.00	1000.00	985.40	1000.00	1000.00
InvDoublePendulum	9337.47 ± 14.96	9355.52	8369.95	8977.94	205.85	9081.92	8487.15

在演员-评论家方法中存在过估计偏差，可能降低学习质量。
剪裁双Q学习显著降低了 actor-critic 目标中的过估计，相较于标准 Double DQN 的变体。
延迟策略更新并使用缓慢的目标网络可以降低每次更新的误差并提高学习稳定性。
目标策略平滑降低目标的方差，从而获得更安全、更鲁棒的价值估计。
TD3 在七个 MuJoCo 任务上的最终性能和学习速度与最先进的基线相匹配或超越。
消融研究表明 CDQ、延迟更新和目标策略平滑（TPS）的联合效果带来最佳性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。