QUICK REVIEW

[論文レビュー] The Mirage of Action-Dependent Baselines in Reinforcement Learning

George Tucker, Surya Bhupatiraju|arXiv (Cornell University)|Feb 27, 2018

Reinforcement Learning in Robotics参考文献 36被引用数 45

ひとこと要約

本論文はポリシー勾配の分散を分解し、学習された状態-行動依存ベースラインが一般的なベンチマークにおいて状態依存ベースラインと比べて分散を有意に低減しないことを示し、実装に起因するバイアスを明らかにし、現実的な改善策としてホライゾン認識型価値関数を提案する。

ABSTRACT

Policy gradient methods are a widely used class of model-free reinforcement learning algorithms where a state-dependent baseline is used to reduce gradient estimator variance. Several recent papers extend the baseline to depend on both the state and action and suggest that this significantly reduces variance and improves sample efficiency without introducing bias into the gradient estimates. To better understand this development, we decompose the variance of the policy gradient estimator and numerically show that learned state-action-dependent baselines do not in fact reduce variance over a state-dependent baseline in commonly tested benchmark domains. We confirm this unexpected result by reviewing the open-source code accompanying these prior papers, and show that subtle implementation decisions cause deviations from the methods presented in the papers and explain the source of the previously observed empirical gains. Furthermore, the variance decomposition highlights areas for improvement, which we demonstrate by illustrating a simple change to the typical value function parameterization that can significantly improve performance.

研究の動機と目的

ポリシー勾配法における状態-行動依存ベースラインからの分散削減について、正確な理解を促す。
ポリシー勾配の分散を分解し、現実的に分散削減が起こりうる箇所を特定する。
合成タスクとベンチマークタスクで分散成分を評価し、状態-action依存ベースラインの実用的な利点を評価する。

提案手法

状態-行動依存ベースラインを用いたポリシー勾配推定量の分散分解を提供する（Eq. 2 および Eq. 3）。
分散項 Sigma_tau, Sigma_a, Sigma_s を分析し、Sigma_a が影響力を持つ条件を特定する。
oracle および learned baselines を用いた LQG および連続制御タスクで分散成分を経験的に測定する。
オープンソース実装をレビューし、バイアスを誘発する実装の詳細を特定する。
有限ホライゾンタスクにより適合するようホライゾン認識型価値関数のパラメータ化を提案する。

実験結果

リサーチクエスチョン

RQ1ベンチマークタスクで、学習された状態-行動依存ベースラインは、状態依存ベースラインよりもポリシー勾配の分散を減らすのか？
RQ2タスクと推定量を横断した分散成分（Sigma_tau, Sigma_a, Sigma_s）の相対的な大きさはどうなるか？
RQ3実装の詳細と価値関数近似が、アクション依存ベースラインの観測される利点にどのように影響するか？
RQ4ホライゾン認識型価値関数は、勾配推定のバイアスを生じさせることなく実用的な改善をもたらすか？

主な発見

検証タスクでは、学習された状態-行動依存ベースラインは、学習された状態依存ベースラインよりも分散を有意に低減しない。
状態-行動依存ベースラインによって削減される分散は、多くの場合、価値関数近似器およびベースライン自体による分散に支配される。
Some reported gains from state-action-dependent baselines arise from implementation choices that introduce bias, not unbiased variance reduction.
Function approximation gaps for V(s) and phi(s,a) contribute more to variance than action dependence of the baseline in typical benchmarks.
A horizon-aware value function parameterization yields performance improvements over baselines in experiments.
Improving value function approximation is a more promising path for variance reduction than adopting action-dependent baselines under current methods.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。