QUICK REVIEW

[論文レビュー] Self-Consistent Trajectory Autoencoder: Hierarchical Reinforcement Learning with Trajectory Embeddings

John D. Co-Reyes, YuXuan Liu|arXiv (Cornell University)|Jun 7, 2018

Reinforcement Learning in Robotics参考文献 28被引用数 67

ひとこと要約

SeCTAR は状態デコーダと潜在条件付きポリシデコーダを持つ軌道レベルVAEを用いて軌道の連続潜在空間を学習し、潜在空間でのモデルベース計画を長期・スパース報酬タスクで実現します。

ABSTRACT

In this work, we take a representation learning perspective on hierarchical reinforcement learning, where the problem of learning lower layers in a hierarchy is transformed into the problem of learning trajectory-level generative models. We show that we can learn continuous latent representations of trajectories, which are effective in solving temporally extended and multi-stage problems. Our proposed model, SeCTAR, draws inspiration from variational autoencoders, and learns latent representations of trajectories. A key component of this method is to learn both a latent-conditioned policy and a latent-conditioned model which are consistent with each other. Given the same latent, the policy generates a trajectory which should match the trajectory predicted by the model. This model provides a built-in prediction mechanism, by predicting the outcome of closed loop policy behavior. We propose a novel algorithm for performing hierarchical RL with this model, combining model-based planning in the learned latent space with an unsupervised exploration objective. We show that our model is effective at reasoning over long horizons with sparse rewards for several simulated tasks, outperforming standard reinforcement learning methods and prior methods for hierarchical reasoning, model-based planning, and exploration.

研究の動機と目的

軌道をモデリングすることによって階層的RLの表現学習を動機付ける。
temporally extended, reusable behaviors を可能にする連続的なスキル潜在空間を提案する。
一貫性を確保し計画を可能にする二頭ヘッドデコーダフレームワーク（状態デコーダとポリシデコーダ）を開発する。
スパース報酬に対処するため、潜在空間でのモデルベース計画を教師なし探索目的と統合する。

提案手法

軌道エンコーダ q_phi(z|tau) を用いて軌道に対する変分オートエンコーダの枠組みを拡張する。
潜在表現 z から軌道を生成する状態デコーダ p_theta_SD(tau|z) を用いる。
潜在軌道を実現するために、環境内で実行されるポリシデコーダ p_theta_PD(a|s,z) を導入する。
デコーダ間の一貫性を、KL(p_theta_PD(tau|z) || p_theta_SD(tau|z)) を最小化しつつ ELBO を最大化して強制する。
状態軌道にはリカレントネットワークで、ポリシデコーダにはフィードフォワードネットワークでエンコーダ/デコーダを訓練する。
閉ループ挙動の予測モデルとして状態デコーダを用いたモデル予測制御で潜在空間を計画する。

実験結果

リサーチクエスチョン

RQ1手動で指定されたサブゴールや離散スキルを用いずに、軌道の連続潜在空間を学習できるか？
RQ2軌道レベルVAEと潜在条件付きポリシの結合訓練は長期的な計画を信頼性高く可能にするか？
RQ3潜在空間でのモデルベース計画を、エントロピーに基づく探索目的と組み合わせることで、スパース報酬タスクの性能を向上させるか？
RQ4状態デコーダは高レベルの潜在行動に対する意味のある結果予測を提供するか？
RQ5SeCTARは長期タスクで従来のモデルフリー、モデルベース、階層RL手法とどう比較されるか？

主な発見

SeCTAR は長期・スパース報酬タスクにおいて拡張された軌道での計画を可能にし、いくつかのベースラインを上回る。
潜在空間 MPC プランナーは状態デコーダを軌道予測子として使用し、報酬を最大化する潜在行動を選択する。
結合訓練により一貫した状態デコーダとポリシデコーダが得られ、閉ループ計画と探索の改善を実現する。
軌道周辺のエントロピーに基づく教師なし探索は、状態空間のカバレッジと探索品質を向上させる。
潜在空間の補間は一貫した軌道を生成し、意味のある一般化可能な潜在表現を示す。
SeCTAR は tested タスクで TRPO、A3C、VIME、FeUdal Networks、option-critic よりも高い性能とサンプル効率を達成する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。