QUICK REVIEW

[論文レビュー] Data-Efficient Hierarchical Reinforcement Learning

Ofir Nachum, Shixiang Gu|arXiv (Cornell University)|May 21, 2018

Reinforcement Learning in Robotics参考文献 3被引用数 265

ひとこと要約

HIRO を提案する、オフポリシー補正を用いたオフポリシー学習による二層階層強化学習エージェントで、移動と物体相互作用タスクにおいて高いサンプル効率と強力な性能を実現。

ABSTRACT

Hierarchical reinforcement learning (HRL) is a promising approach to extend traditional reinforcement learning (RL) methods to solve more complex tasks. Yet, the majority of current HRL methods require careful task-specific design and on-policy training, making them difficult to apply in real-world scenarios. In this paper, we study how we can develop HRL algorithms that are general, in that they do not make onerous additional assumptions beyond standard RL algorithms, and efficient, in the sense that they can be used with modest numbers of interaction samples, making them suitable for real-world problems such as robotic control. For generality, we develop a scheme where lower-level controllers are supervised with goals that are learned and proposed automatically by the higher-level controllers. To address efficiency, we propose to use off-policy experience for both higher and lower-level training. This poses a considerable challenge, since changes to the lower-level behaviors change the action space for the higher-level policy, and we introduce an off-policy correction to remedy this challenge. This allows us to take advantage of recent advances in off-policy model-free RL to learn both higher- and lower-level policies using substantially fewer environment interactions than on-policy algorithms. We term the resulting HRL agent HIRO and find that it is generally applicable and highly sample-efficient. Our experiments show that HIRO can be used to learn highly complex behaviors for simulated robots, such as pushing objects and utilizing them to reach target locations, learning from only a few million samples, equivalent to a few days of real-time interaction. In comparisons with a number of prior HRL methods, we find that our approach substantially outperforms previous state-of-the-art techniques.

研究の動機と目的

標準的なRLコンポーネントで機能する、汎用でデータ効率の高いHRLの動機づけと開発。
高位コントローラによって自動的に提案される goal に guided された下位ポリシーの学習。
サンプル効率を改善するため、階層の両レベルでオフポリシー学習を可能にする。
下位レベルの変更の非定常性に対処するためのオフポリシー補正の導入。
限られた相互作用データで、挑戦的なシミュレートされたロボットタスクでの強力な性能を実証。

提案手法

高位ポリシー（ゴール）と低位ポリシー（行動）を持つ二層階層。
下位はゴール g_t を受け取り、内部報酬 r = -||s_t + g_t - s_{t+1}||_2 を発生させる；高位は c ステップごとに時間的に拡張されたゴールを最適化。
過去の低位行動が現在の下位コントローラ下で発生する確率を最大化するように relabeled（オフポリシー補正）し、オフポリシー学習を可能にする。
両方のポリシーをリプレイバッファを用いたオフポリシーTD法（TD3）で学習。
ゴールは生の状態観測で直接定義され、学習された埋め込みや手動ゴール空間を回避。
高位のリラベリングには、元のゴールと差分ベースのゴールを含む八候補のリラベリング手順を用い、尤度の最大化を近似する。

実験結果

リサーチクエスチョン

RQ1オフポリシー補正を用いてオフポリシー学習を行う二層HRLシステムは、複雑なタスクを効率的に学習できるか。
RQ2下位ポリシーのゴールとして生の状態観測を用いると、学習速度と性能は向上するか。
RQ3提案されたオフポリシー補正は、素朴なオフポリシーHRLと比較して安定性とサンプル効率にどのように影響するか。
RQ4HIRO の挑戦的な locomotion および物体相互作用タスクにおける性能は、従来のHRL手法と比べてどうか。

主な発見

アント・ギャザー	アント・迷路	アント・プッシュ	アント・フォール
HIRO	3.02 ± 1.49	0.99 ± 0.01	0.92 ± 0.04	0.66 ± 0.07
FuN representation	0.03 ± 0.01	0.0 ± 0.0	0.0 ± 0.0	0.0 ± 0.0
FuN transition PG	0.41 ± 0.06	0.0 ± 0.0	0.56 ± 0.39	0.01 ± 0.02
FuN cos similarity	0.85 ± 1.17	0.16 ± 0.33	0.06 ± 0.17	0.07 ± 0.22
FuN	0.01 ± 0.01	0.0 ± 0.0	0.0 ± 0.0	0.0 ± 0.0
SNN4HRL	1.92 ± 0.52	0.0 ± 0.0	0.02 ± 0.01	0.0 ± 0.0
VIME	1.42 ± 0.90	0.0 ± 0.0	0.02 ± 0.02	0.0 ± 0.0

HIRO は Ant Gather、Ant Maze、Ant Push、Ant Fall のタスクで優れた性能を達成。
10Mステップで、Ant Gather をはじめとする全タスクで FuN 変種、SNN4HRL、VIME のベースラインを上回る；下位を事前訓練した最も近い競合は Ant Gather。
HIRO は迅速な学習を示し、数百万環境ステップ（現実世界の数日程度の相互作用）後に複雑なタスクを解く。
オフポリシー補正は安定性と harder なタスクの性能にとって重要であり、素朴なオフポリシー学習は Ant Push および Ant Fall で劣化する。
下位ポリシーのゴールとして生の状態観測を用いると、即座に intrinsic reward 信号が得られ、タスク間での単純な汎化が進む。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。