QUICK REVIEW

[論文レビュー] Planning with Goal-Conditioned Policies

Soroush Nasiriany, Vitchyr H. Pong|arXiv (Cornell University)|Nov 19, 2019

Reinforcement Learning in Robotics被引用数 53

ひとこと要約

LEAP は、画像のような高次元観測から長期的課題を解くために、モデルフリーのゴール条件付きポリシーを学習潜在状態空間上の計画と組み合わせる。

ABSTRACT

Planning methods can solve temporally extended sequential decision making problems by composing simple behaviors. However, planning requires suitable abstractions for the states and transitions, which typically need to be designed by hand. In contrast, model-free reinforcement learning (RL) can acquire behaviors from low-level inputs directly, but often struggles with temporally extended tasks. Can we utilize reinforcement learning to automatically form the abstractions needed for planning, thus obtaining the best of both approaches? We show that goal-conditioned policies learned with RL can be incorporated into planning, so that a planner can focus on which states to reach, rather than how those states are reached. However, with complex state observations such as images, not all inputs represent valid states. We therefore also propose using a latent variable model to compactly represent the set of valid states for the planner, so that the policies provide an abstraction of actions, and the latent variable model provides an abstraction of states. We compare our method with planning-based and model-free methods and find that our method significantly outperforms prior work when evaluated on image-based robot navigation and manipulation tasks that require non-greedy, multi-staged behavior.

研究の動機と目的

環境モデリングを詳細に行わず、時間的組成性を得るために、モデルフリー強化学習と planning を組み合わせる動機付け。
ゴール条件付き価値関数を暗黙のモデルとして用いたサブゴール計画を提案する。
有効な状態のマンホールド内にサブゴールを保つ潜在状態表現を学ぶ。
ゴール到達ポリシーを用いた潜在サブゴール上の計画が、視覚ベースのタスクで従来のモデルフリーおよびモデルベース手法よりも優れていることを示す。

提案手法

Temporal Difference Models (TDMs) で訓練されたゴール条件付きポリシーを短期ホライズンのコントローラとして使用する。
VAE で学習された低次元潜在空間における中間サブゴール上で計画する。
サブゴールのための V(s,g,t) による到達可能性の適合性ベクトルを定義し、そのノルムを最小化してサブゴールを選択する。
妥当な状態のマニフォールドを保つため、潜在尤度の低いペナルティを伴って潜在空間のサブゴールを最適化する。
VAEデコーダで潜在サブゴールを実際の状態ゴールへデコードし、ゴール条件付きポリシーで実行する。
生のピクセルの代わりに潜在空間で計画することで高次元観測を扱い、RLにはVAEエンコーダを再利用する。

実験結果

リサーチクエスチョン

RQ1ゴール条件付きポリシーは長期課題の計画の抽象化として機能しますか？
RQ2潜在表現上の計画は、目標が高次元（例: 画像）の場合、実現性と性能を改善しますか？
RQ3LEAP は、画像ベースのナビゲーションおよび操作タスクにおいて、純粋なモデルフリーおよび純粋なモデルベース手法とどのように比較されますか？
RQ4事前学習済みVAEエンコーダの再利用が学習効率と性能に与える影響は何ですか？

主な発見

LEAP は視覚ベースのナビゲーションと操作タスクで従来のモデルフリーおよびモデルベース手法を上回る。
TDMベースのポリシーを用いた3つの潜在サブゴール上での計画は、短期ゴール単独よりも長期ゴール達成を速くする。
潜在サブゴール上の最適化は、実現可能な状態に対応する意味のあるサブゴールを生み出す。生の画像ピクセル上の最適化とは異なる。
VAEエンコーダの再利用は、RLネットワークをゼロから訓練するより学習を加速する。
アブレーションでは、潜在空間での計画が画像空間で直接計画するより大幅に効果的である。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。