QUICK REVIEW

[論文レビュー] Two Time-scale Off-Policy TD Learning: Non-asymptotic Analysis over Markovian Samples

Tengyu Xu, Shaofeng Zou|arXiv (Cornell University)|Sep 26, 2019

Advanced Bandit Algorithms Research被引用数 44

ひとこと要約

この論文は、非i.i.d. マルコフサンプルの下での二時刻 TDC の初の非漸近収束分析を提供し、減衰・定歩幅の収束速度を導出し、ブロックごとの減衰スキームを提案します。

ABSTRACT

Gradient-based temporal difference (GTD) algorithms are widely used in off-policy learning scenarios. Among them, the two time-scale TD with gradient correction (TDC) algorithm has been shown to have superior performance. In contrast to previous studies that characterized the non-asymptotic convergence rate of TDC only under identical and independently distributed (i.i.d.) data samples, we provide the first non-asymptotic convergence analysis for two time-scale TDC under a non-i.i.d.\ Markovian sample path and linear function approximation. We show that the two time-scale TDC can converge as fast as O(log t/(t^(2/3))) under diminishing stepsize, and can converge exponentially fast under constant stepsize, but at the cost of a non-vanishing error. We further propose a TDC algorithm with blockwisely diminishing stepsize, and show that it asymptotically converges with an arbitrarily small error at a blockwisely linear convergence rate. Our experiments demonstrate that such an algorithm converges as fast as TDC under constant stepsize, and still enjoys comparable accuracy as TDC under diminishing stepsize.

研究の動機と目的

勾配ベースのTDと線形関数近似を用いたオフポリシー価値関数評価の動機付け。
マルコフデータと減衰するステップサイズの下で、二時刻TDCの非漸近収束を特徴づける。
定歩幅の挙動と、それに伴う学習/追跡誤差のダイナミクスを探る。
任意の小さな学習誤差で高速収束を達成するブロック別減衰ステップサイズスキームを提案する。

提案手法

重要度重み付けを用いたオフポリシー評価の MSPBE 目的関数を定式化。
θ（遅い）と w（速い）に対する投影付きの二時刻確率近似更新を定義。
減衰ステップサイズの下で O(log t / t^{2/3}) までのレートを示す非漸近界を導出。
定歩幅について、θ* の近傍へ収束することを示す非漸近界を導出し、バイアス追跡誤差項を明示する。
ブロックごとの減衰ステップサイズ（アルゴリズム1）を導入し、ブロックごとの線形収束で任意の精度へ到達することを証明する。

実験結果

リサーチクエスチョン

RQ1減衰ステップサイズを有する非i.i.d. マルコフサンプル下の二時刻TDC の非漸近収束レートはどれくらいか？
RQ2定歩幅は二時刻TDC の学習および追跡誤差にどのような影響を与えるか？
RQ3ブロックごとの減衰ステップサイズスキームは小さい学習誤差で高速収束を達成できるか？
RQ4追跡誤差は二時点TD学習における遅いタイムスケールの学習誤差にどのように影響するか？

主な発見

減衰ステップサイズの下で、θ_t は O((log t)/t^{2/3}) の速度で収束する（σ=3ν/2=1 のとき達成）。
定歩幅では、θ_t は θ* の近傍へ指数関数的に収束し、その近傍の大きさはバイアスと追跡誤差項によって決まる。
追跡誤差 z_t = w_t − ψ(θ_t) は、条件数が異なるため θ_t とは異なるレートで減衰する。
ブロックごとの減衰ステップサイズは、ブロックごとの線形収束で任意の精度へ到達し、標準の減衰ステップサイズよりもわずかに良いサンプル複雑度を持つ。
実験では、ブロックごとの減衰ステップサイズが定歩幅と同等の速度を示しつつ、減衰ステップサイズと同様の精度を維持。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。