QUICK REVIEW

[論文レビュー] True Online Temporal-Difference Learning

Harm van Seijen, A. Rupam Mahmood|arXiv (Cornell University)|Dec 13, 2015

Reinforcement Learning in Robotics参考文献 18被引用数 56

ひとこと要約

この論文は、標準のTD(λ)更新を2つの鍵となる変更で修正することで、各時刻で前方視点TD(λ)と正確に等価を維持する、新しいアルゴリズムであるTrue Online Temporal-Difference Learningを導入する。ランダムなマルコフ報酬過程、筋電義肢アーム、アタリ環境における実験的結果から、True Online TD(λ)とSarsa(λ)は、蓄積的トレースと代替的トレースの選択を必要とせず、学習速度が向上する一方で性能に劣化がないことが示された。

ABSTRACT

The temporal-difference methods TD($λ$) and Sarsa($λ$) form a core part of modern reinforcement learning. Their appeal comes from their good performance, low computational cost, and their simple interpretation, given by their forward view. Recently, new versions of these methods were introduced, called true online TD($λ$) and true online Sarsa($λ$), respectively (van Seijen & Sutton, 2014). These new versions maintain an exact equivalence with the forward view at all times, whereas the traditional versions only approximate it for small step-sizes. We hypothesize that these true online methods not only have better theoretical properties, but also dominate the regular methods empirically. In this article, we put this hypothesis to the test by performing an extensive empirical comparison. Specifically, we compare the performance of true online TD($λ$)/Sarsa($λ$) with regular TD($λ$)/Sarsa($λ$) on random MRPs, a real-world myoelectric prosthetic arm, and a domain from the Arcade Learning Environment. We use linear function approximation with tabular, binary, and non-binary features. Our results suggest that the true online methods indeed dominate the regular methods. Across all domains/representations the learning speed of the true online methods are often better, but never worse than that of the regular methods. An additional advantage is that no choice between traces has to be made for the true online methods. Besides the empirical results, we provide an in-depth analysis of the theory behind true online temporal-difference learning. In addition, we show that new true online temporal-difference methods can be derived by making changes to the online forward view and then rewriting the update equations.

研究の動機と目的

標準のTD(λ)とSarsa(λ)に内在する理論的・実験的限界、特に小さなステップサイズの極限でのみ前方視点を近似するという点を是正すること。
各時刻で前方視点と正確に等価を保つ手法を開発し、バイアス・バリアンストレードオフを完全に制御できることを保証すること。
True Online TD(λ)の改善された理論的性質が、多様なドメインおよび関数近似設定において優れた性能にどのように反映されるかを実験的に評価すること。
著者らが主張するように、蓄積的トレースと代替的トレースの選択を排除できることを実証すること。

提案手法

時間とともに段階的に成長する、λリターンの有界版に基づく新しいオンライン前方視点を導入し、オンライン更新を可能にする。
このオンライン前方視点から直接True Online TD(λ)の更新式を導出し、各ステップで正確な等価性を保証する。
標準のTD(λ)更新を、現在と直前の重みベクトルの射影の差に基づく補正項を組み込むことで修正し、エリギビリティトレースを用いる。
再帰的更新によりエリギビリティトレースを維持する：$\mathbf{e}_t = \gamma\lambda\mathbf{e}_{t-1} + \bm{\phi}_t - \alpha\gamma\lambda(\mathbf{e}_{t-1}^\top\bm{\phi}_t)\bm{\phi}_t$。これにより、正確なオンライン計算が可能になる。
制御タスク用にTrue Online Sarsa(λ)を導出するための同一の導出フレームワークを適用し、非政策学習における前方視点の等価性を保証する。
表形式、バイナリ、非バイナリ特徴を用いた線形関数近似を用い、表現タイプにわたる一般化性を評価する。

実験結果

リサーチクエスチョン

RQ1True Online TD(λ)は、多様な環境および関数近似スキームにおいて、標準のTD(λ)よりも優れた学習速度を達成するか？
RQ2True Online TD(λ)は、非無限小のステップサイズでさえも、各時刻で前方視点と正確に等価を保つことができるか？
RQ3著者らの主張どおり、蓄積的トレースと代替的トレースの選択を排除できるか？
RQ4制御タスクにおいて、True Online Sarsa(λ)は標準のSarsa(λ)と比べて学習速度と性能で優れているか？
RQ5提案されたオンライン前方視点フレームワークは、他の真のオンライン時系列差分アルゴリズムを導出するために一般化可能か？

主な発見

True Online TD(λ)は、ランダムなMRP、筋電義肢アーム、アタリ環境を含むすべてのテストドメインで、標準のTD(λ)よりも一貫して高速な学習速度を達成した。
すべてのテスト環境および表現タイプ（表形式、バイナリ、非バイナリ特徴）において、True Online TD(λ)は標準のTD(λ)を下回ることはなく、収束速度の面で顕著に優れていた。
標準のTD(λ)が小さなステップサイズの極限でのみこの等価性を近似するのに対し、True Online TD(λ)は中程度のステップサイズでも各時刻で前方視点と正確に等価であることを達成した。
制御タスク（義肢アームやアタリを含む）において、True Online Sarsa(λ)は、蓄積的トレースと代替的トレースの両方の標準Sarsa(λ)よりも優れた性能を示した。
このアルゴリズムは、オンライン前方視点からの導出に起因して、蓄積的トレースと代替的トレースの選択を排除することができた。
k=10, k=100, b=3, b=10, σ=0.1, σ=0 という異なるパラメータを有するランダムなMRPにおける実験結果は、ノイズや複雑さのレベルが異なる状況でも、真のオンライン手法の優位性が一貫して確認された。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。