QUICK REVIEW

[論文レビュー] Finite-Time Analysis of Distributed TD(0) with Linear Function Approximation for Multi-Agent Reinforcement Learning

Thinh T. Doan, Siva Theja Maguluri|arXiv (Cornell University)|Feb 20, 2019

Distributed Control Multi-Agent Systems被引用数 50

ひとこと要約

この論文は、分散TD(0)アルゴリズムを線形関数近似とともにマルチエージェント設定で分析し、時間変動する通信グラフ上で有限時間収束率を証明します。ネットワークトポロジーと割引因子に依存する明示的な境界を導出します。

ABSTRACT

We study the policy evaluation problem in multi-agent reinforcement learning. In this problem, a group of agents works cooperatively to evaluate the value function for the global discounted accumulative reward problem, which is composed of local rewards observed by the agents. Over a series of time steps, the agents act, get rewarded, update their local estimate of the value function, then communicate with their neighbors. The local update at each agent can be interpreted as a distributed consensus-based variant of the popular temporal difference learning algorithm TD(0). While distributed reinforcement learning algorithms have been presented in the literature, almost nothing is known about their convergence rate. Our main contribution is providing a finite-time analysis for the convergence of the distributed TD(0) algorithm. We do this when the communication network between the agents is time-varying in general. We obtain an explicit upper bound on the rate of convergence of this algorithm as a function of the network topology and the discount factor. Our results mirror what we would expect from using distributed stochastic gradient descent for solving convex optimization problems.

研究の動機と目的

マルチエージェント強化学習（MARL）設定において、エージェントは局所報酬を観測し、グローバル価値関数を推定するために協力する政策評価を動機づける。
線形関数近似と局所更新を用いた分散合意ベースのTD(0)アルゴリズムを提案する。
時間変動する通信グラフの下で分散TD(0)法の有限時間収束レートを提供する。
収束レートをネットワークトポロジー、割引因子、ステップサイズの選択と関連づける。
分散TD(0)が凸最適化における分散SGDと同様にスケールする理解の基盤を築く。）

提案手法

各エージェントが近傍の推定値を平均する合意ステップを用いた分散TD(0)更新を定式化する。
d_v(k) = r_v(k) + gamma * tilde J(s'(k), theta_v) - tilde J(s(k), theta_v) による線形関数近似を用いたTD(0)方向を組み込む。
推定値を有界に保つため convex set X への射影を適用する。
接続性と二重確定性重みに関する仮定を用いた行列形の分析 W(k) を確立する。
定数ステップサイズで O(1/k) の収束、減衰ステップサイズで O(1/√k) の収束を示す有限時間境界を導出し、θ の推定値にも類似結果を示す。
各エージェントごとの近似値関数とθへの平均化パラメータベクトルの θ* への収束率を示す二つの主要定理を提供する。）

実験結果

リサーチクエスチョン

RQ1分散TD(0)アルゴリズム（線形関数近似）に対して、MARLにおける有限時間収束保証は何が確立できるのか？
RQ2ネットワークトポロジーと割引因子は分散TD(0)の収束速度にどのような影響を与えるか？
RQ3分散TD(0)の方法は凸最適化における分散確率的勾配降下法と同等の速度を実現できるか？
RQ4この設定で最適な有限時間性能を達成するためのステップサイズのスケジュールはどのような役割を果たすのか？

主な発見

分散TD(0)アルゴリズムは、定数ステップサイズ下で最適値の近傍へ収束する。
減衰ステップサイズ1/√kでは、値関数推定の収束速度は O(1/√k) に改善する。
収束速度は明示的に割引因子 (1−gamma) とネットワーク特性の固有値ギャップ (1−delta) に依存し、 connectivity に結びつく。
Theta-v 推定は theta* へ、最小固有値 sigma_min と A の条件数に依存する速さで収束する。
分散SGD の凸問題に対する直感と整合しており、MARL の合意ベースTD学習へ有限時間分析を拡張している。
特定のステップサイズのレジームの下で、平均パラメータベクトルは収束を指数関数的またはサブ線形に到達する（収束因子に依存）。”] ,
table_headers: []
table_rows: []} } } } } } } } } } } } } } } } } } }}`*Oops*``` (The above is a mistake; please ignore trailing). The final output should be valid JSON without syntax errors. Adjusted below.**]**

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。