QUICK REVIEW

[論文レビュー] A Game Theoretic Framework for Model Based Reinforcement Learning

Aravind Rajeswaran, Igor Mordatch|arXiv (Cornell University)|Apr 16, 2020

Reinforcement Learning in Robotics参考文献 67被引用数 43

ひとこと要約

本論文はモデルベース強化学習をポリシー・プレイヤーとモデル・プレイヤーの二人対戦ゲームとして捉え、Stackelberg-gameベースのアルゴリズム（PALとMAL）で解決する。これらは高いサンプル効率を達成し、高次元タスクへスケールする。

ABSTRACT

Model-based reinforcement learning (MBRL) has recently gained immense interest due to its potential for sample efficiency and ability to incorporate off-policy data. However, designing stable and efficient MBRL algorithms using rich function approximators have remained challenging. To help expose the practical challenges in MBRL and simplify algorithm design from the lens of abstraction, we develop a new framework that casts MBRL as a game between: (1) a policy player, which attempts to maximize rewards under the learned model; (2) a model player, which attempts to fit the real-world data collected by the policy player. For algorithm development, we construct a Stackelberg game between the two players, and show that it can be solved with approximate bi-level optimization. This gives rise to two natural families of algorithms for MBRL based on which player is chosen as the leader in the Stackelberg game. Together, they encapsulate, unify, and generalize many previous MBRL algorithms. Furthermore, our framework is consistent with and provides a clear basis for heuristics known to be important in practice from prior works. Finally, through experiments we validate that our proposed algorithms are highly sample efficient, match the asymptotic performance of model-free policy gradient, and scale gracefully to high-dimensional tasks like dexterous hand manipulation. Additional details and code can be obtained from the project page at https://sites.google.com/view/mbrl-game

研究の動機と目的

実践的なモデルベースRLの課題を抽出し、抽象化を通じてアルゴリズム設計を統合する。
MBRLをポリシー最適化と世界モデル適合の二人対戦ゲームとして捉える。
連続的なゲームにおける平衡を計算するためのStackelbergベースのアルゴリズムを開発する。
サンプル効率の改善と高次元タスクへのスケーラビリティを示す。
従来のMBRLアプローチを結びつけ、一般化する洞察を提供する。

提案手法

モデルベースRLは二人対戦ゲームとして定式化される。学習済みモデル内で報酬を最大化するのはポリシー・プレイヤーであり、ポリシーが誘導する状態分布の下で予測誤差を最小化するのはモデル・プレイヤーである。
安定した二階層最適化を可能にするStackelbergゲーム構造を採用し、実用的な勾配ベースの更新を導出する。
リーダー/フォロワーの二つの変種を導入する：Policy as Leader（PAL）とModel as Leader（MAL）、それぞれに特定のネスト最適化スキームを備える。
双対更新を解くために一階近似を用い、モデルを先に更新してからポリシーを更新（PAL）するか、ポリシーを先に更新してからモデルを更新（MAL）する、反復更新を可能にする。
ポリシーと力学モデルをニューラルネットワークで表現し、頑健性のためにアンサンブルとエントロピー正則化を採用する。

実験結果

リサーチクエスチョン

RQ1MBRLを二人対戦ゲームとして見ると、安定した効率的な学習ダイナミクスを得られるか。
RQ2StackelbergベースのPALとMALアルゴリズムは、従来のMBRLおよびモデルフリーメソッドと比べてサンプル効率とスケーラビリティを改善するか。
RQ3ダイナミクスやゴール分布が変化する環境で、PALとMALはどう比較されるか。
RQ4環境内の平衡の質とポリシー最適性を結びつける理論的保証はあるか。

主な発見

PALとMALは連続制御タスク全般で安定的な、ほぼ単調な学習を示す。
両方とも従来のモデルベースおよびモデルフリーメソッドよりサンプル効率が高く、高次元の巧緻な操作タスクへスケールする。
PALは studied tasks でMALより速く学習する傾向があり、MALはゴール分布の変化をよりうまく扱う。
BR（Best Response）は不安定さを招く一方、GDA（Gradient Descent-Ascent）は設定によって遅いまたは不安定になる。
評価タスクにおいて、手法はモデルフリーポリシー勾配ベースのベースラインと同等の漸近的性能を達成する。
このフレームワークは保守性とデータ集約を原理的なゲーム理論的視点で結びつけることにより、従来のMBRLアプローチを統一・一般化する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。