QUICK REVIEW

[論文レビュー] DeepMDP: Learning Continuous Latent Space Models for Representation Learning

Carles Gelada, Saurabh Kumar|arXiv (Cornell University)|Jun 6, 2019

Reinforcement Learning in Robotics参考文献 58被引用数 67

ひとこと要約

DeepMDPは、2つの損失を最小化することによりMDPの連続的な潜在空間モデルを学習し、報酬予測と次の潜在状態予測を行い、理論的な保証とRLにおける補助タスクとして使用した場合の性能向上をもたらす。

ABSTRACT

Many reinforcement learning (RL) tasks provide the agent with high-dimensional observations that can be simplified into low-dimensional continuous states. To formalize this process, we introduce the concept of a DeepMDP, a parameterized latent space model that is trained via the minimization of two tractable losses: prediction of rewards and prediction of the distribution over next latent states. We show that the optimization of these objectives guarantees (1) the quality of the latent space as a representation of the state space and (2) the quality of the DeepMDP as a model of the environment. We connect these results to prior work in the bisimulation literature, and explore the use of a variety of metrics. Our theoretical findings are substantiated by the experimental result that a trained DeepMDP recovers the latent structure underlying high-dimensional observations on a synthetic environment. Finally, we show that learning a DeepMDP as an auxiliary task in the Atari 2600 domain leads to large performance improvements over model-free RL.

研究の動機と目的

高次元の観測を有意義な連続潜在状態へ縮約することによるRLの表現学習を動機づける。
報酬と次状態分布の計算可能な損失で訓練されるDeepMDP潜在空間モデルを提案する。
潜在空間学習と表現・モデル品質を結ぶ理論的保証を提供する。
DeepMDPとビスイミュレーションを結びつけ、潜在遷移の異なる確率計測量を検討する。
DeepMDPをモデルフリーRLの性能向上の補助タスクとして実用性を示す。

提案手法

DeepMDPをSからS_barへの埋め込みphiを用いる潜在空間モデルとして定義する。
Train by minimizing two losses: L_R = |R(s,a) - R_bar(phi(s),a)| and L_P = D(phi P(.|s,a), P_bar(.|phi(s),a]).
潜在遷移損失のためにWasserstein（および他のMMDベース）測度を使用して、理論的保証を可能にする。
L_R、L_P、およびリプシッツ定数に基づく価値差と表現品質のグローバルおよびローカル境界を導出する。
Wasserstein距離とビスイミュレーション距離との間にDeepMDPの関連を確立する。
ノーム-MMD測度への保証の一般化と深層ネットワーク方策学習への影響を論じる。）

実験結果

リサーチクエスチョン

RQ1報酬と遷移予測を基に訓練されたパラメータ付き潜在空間モデルは、状態空間の良い表現と環境の良いモデルの両方を提供し得るか。
RQ2確率測度の選択（特にWasserstein）がおよぶ保証とビスイミュレーションとの関係にどのように影響するか。
RQ3DeepMDP表現は高次元観測の基礎となる潜在構造を回復するか。
RQ4DeepMDPは補助タスクとして使用することでモデルフリーRLを改善できるか。例としてAtari 2600ゲームで？
RQ5部分的な状態空間データからDeepMDPを学習する際の局所的（データ効率的）保証は何か。

主な発見

DeepMDPは、正確な潜在予測が元のMDPにおける正確な価値関数を生み出すことを示す境界を提供する。
埋め込みphiは、もしグローバル損失L_RとL_Pがゼロなら、DeepMDPがリプシッツ項の範囲で価値関係を保持することを保証する。
WassersteinベースのDeepMDP損失とビスイミュレーション距離との理論的関係が確立される。
局所的なDeepMDP損失は、部分的な状態-行動データしか利用できない場合に保証を可能にする。
実証結果は、DeepMDPが合成環境で高次元観測から潜在構造を回復することを示す。
Atari 2600でDeepMDPを補助タスクとして使用すると、モデルフリーのベースラインに対して顕著な性能向上をもたらす。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。