Skip to main content
QUICK REVIEW

[論文レビュー] Addressing Function Approximation Error in Actor-Critic Methods

Scott Fujimoto, Herke van Hoof|arXiv (Cornell University)|Feb 26, 2018
Reinforcement Learning in Robotics参考文献 39被引用数 2,362
ひとこと要約

本論文は、actor-critic 法法における過大評価バイアスを特定し、TD3を導入する。これはバイアスと分散を低減する一連の手法(クリップドダブルQ学習、遅延ポリシー更新、ターゲットポリシースムージング)を含み、OpenAI Gymの連続制御タスクで優れた性能を達成する。

ABSTRACT

In value-based reinforcement learning methods such as deep Q-learning, function approximation errors are known to lead to overestimated value estimates and suboptimal policies. We show that this problem persists in an actor-critic setting and propose novel mechanisms to minimize its effects on both the actor and the critic. Our algorithm builds on Double Q-learning, by taking the minimum value between a pair of critics to limit overestimation. We draw the connection between target networks and overestimation bias, and suggest delaying policy updates to reduce per-update error and further improve performance. We evaluate our method on the suite of OpenAI gym tasks, outperforming the state of the art in every environment tested.

研究の動機と目的

  • Demonstrate that overestimation bias and high variance occur in actor-critic methods and harm learning.
  • Adapt and extend Double Q-learning to actor-critic frameworks to reduce bias.
  • Develop mechanisms (target networks, delayed policy updates, and policy smoothing) to reduce variance and improve stability.
  • Empirically validate TD3 on seven OpenAI Gym continuous control tasks and compare to baselines.

提案手法

  • Introduce Clipped Double Q-learning by taking the minimum of two independent critics for target calculation.
  • Use two separate critics and two independent actors with corresponding targets to reduce coupling between actor and critic updates.
  • Delay policy updates relative to critic updates to allow value estimates to converge before policy optimization.
  • Apply target policy smoothing by adding clipped noise to the target action to reduce target variance.
  • Maintain target networks with slow updates to stabilize learning and reduce per-update error.
  • Evaluate on MuJoCo continuous control tasks and compare to DDPG, PPO, TRPO, ACKTR, and SAC.

実験結果

リサーチクエスチョン

  • RQ1Do overestimation bias and high-variance TD errors occur in actor-critic methods with function approximation?
  • RQ2Can clipping the Q-value estimates via clipped Double Q-learning reduce overestimation bias in actor-critic settings?
  • RQ3Do target networks, delayed policy updates, and target policy smoothing improve stability and performance in continuous control tasks?

主な発見

EnvironmentTD3DDPGOur DDPGPPOTRPOACKTRSAC
HalfCheetah9636.95 ± 859.0653305.608577.291795.43-15.571450.462347.19
Hopper3564.07 ± 114.742020.461860.022164.702471.302428.392996.66
Walker2d4682.82 ± 539.641843.853098.113317.692321.471216.701283.67
Ant4372.44 ± 1000.331005.30888.771083.20-75.851821.94655.35
Reacher-3.60 ± 0.56-6.51-4.01-6.18-111.43-4.26-4.44
InvPendulum1000.00 ± 0.001000.001000.001000.00985.401000.001000.00
InvDoublePendulum9337.47 ± 14.969355.528369.958977.94205.859081.928487.15
  • Overestimation bias is present in actor-critic methods and can degrade learning quality.
  • Clipped Double Q-learning substantially reduces overestimation in actor-critic targets compared to standard Double DQN variants.
  • Delaying policy updates and using slow target networks decrease per-update error and improve learning stability.
  • Target policy smoothing reduces variance in targets and leads to safer, more robust value estimation.
  • TD3 matches or outperforms state-of-the-art baselines across seven MuJoCo tasks in final performance and learning speed.
  • Ablation studies show the combined effect of CDQ, delayed updates, and TPS yields the best performance.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。