QUICK REVIEW

[論文レビュー] Eigenoption Discovery through the Deep Successor Representation

Marlos C. Machado, Clemens Rosenbaum|arXiv (Cornell University)|Oct 30, 2017

Reinforcement Learning in Robotics参考文献 26被引用数 61

ひとこと要約

この論文は eigenoption discovery を stochastic environments に拡張し、 successor representation を活用して proto-value functions を学習することにより、 raw pixel inputs から eigenoptions を可能にし、探索を改善します。

ABSTRACT

Options in reinforcement learning allow agents to hierarchically decompose a task into subtasks, having the potential to speed up learning and planning. However, autonomously learning effective sets of options is still a major challenge in the field. In this paper we focus on the recently introduced idea of using representation learning methods to guide the option discovery process. Specifically, we look at eigenoptions, options obtained from representations that encode diffusive information flow in the environment. We extend the existing algorithms for eigenoption discovery to settings with stochastic transitions and in which handcrafted features are not available. We propose an algorithm that discovers eigenoptions while learning non-linear state representations from raw pixels. It exploits recent successes in the deep reinforcement learning literature and the equivalence between proto-value functions and the successor representation. We use traditional tabular domains to provide intuition about our approach and Atari 2600 games to demonstrate its potential.

研究の動機と目的

階層型 RL のための効果的な options（eigenoptions）を自動的に発見するという challenge を動機づけ、解決する。
確率的環境における拡散情報フローを学習する表現学習アプローチを導入する。
proto-value functions と successor representation の等価性を利用して eigenoption discovery を導く。
生のピクセル入力から SR を学習しつつ状態表現を学習するニューラルネットワークアーキテクチャを開発する。
直感のために tabular ドメインでデモを行い、raw pixels での実現可能性を示すため Atari 2600 ゲームでデモを行う。

提案手法

環境の拡散情報フロー（DIF）の表現から導出された eigenpurpose によって eigenoption を定義する。
proto-value functions との同等性を活用して DIF モデルを推定するために successor representation（SR）を用いる。
表形式の場合、サンプルから SR を学習し、得られた行列から eigenpurposes を抽出する；これを用いて eigenoptions の開始集合、方針、終了集合を定義する。
生データピクセル入力から SR を推定するニューラルネットワークをトレーニングし、再構成補助タスクと潜在特徴を得る射影器を用いて深層学習に拡張する。
SR の出力から eigenpurposes を計算する（SR 観測に基づくランダム方針下の行列の右特異ベクトルとして）し、それに対応する intrinsic rewards を最大化するオプションを学習する。
Atari ゲームにおいて一歩先読みの greedy で定性的に評価し、意味のある目標指向的な行動を可視化する。

実験結果

リサーチクエスチョン

RQ1確率的環境で状態を列挙したり手作りの特徴を用いずに eigenoptions を発見できるか。
RQ2raw pixel inputs から SR を学習すると探索と制御に有用な eigenoptions が得られるか。
RQ3SR ベースの eigenoptions は PVF ベースの eigenoptions と比較してエージェントの挙動をどれだけ導くのが近いか。
RQ4SR ベースのオプション発見パイプラインを組み込むと Atari ゲームで primitive actions のみと比べて探索と学習が改善されるか。

主な発見

SR ベースのアプローチは predefined な状態表現を必要とせず、確率的な設定でも eigenoptions を発見できる。
SR 観測から導出された eigenpurposes は、表形式の部屋や Atari 実験で探索を改善する意味のあるターゲット行動を生み出す（拡散時間の短縮）。
SR から学習した tabular 領域の eigenoptions は PVF ベースの eigenvectors に近似し、Q 学習と組み合わせると学習を改善する。
Atari 実験では、深層 SR ネットワークが raw pixels から潜在表現を学習し、それが目的指向的な eigenoptions を生み出し、エージェントを特定の画面位置へ導く。
eigenoptions は intrinsic rewards をより密にし、探索を優先させる傾向を示し、SR が限られたサンプルから推定されても有効性を維持する。
このアプローチは SR の推定が不完全でも頑健性を示し、表現学習の質に対する耐性を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。