QUICK REVIEW

[論文レビュー] Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play

Sainbayar Sukhbaatar, Zeming Lin|arXiv (Cornell University)|Mar 15, 2017

Reinforcement Learning in Robotics参考文献 27被引用数 142

ひとこと要約

本論文は、Alice と Bob の二つのポリシーを用いた非対称自己プレイを導入し、Alice がタスクを生成し Bob がそれを解くことで、教師なしの環境理解を可能にし、ターゲットタスクの学習を加速します。

ABSTRACT

We describe a simple scheme that allows an agent to learn about its environment in an unsupervised manner. Our scheme pits two versions of the same agent, Alice and Bob, against one another. Alice proposes a task for Bob to complete; and then Bob attempts to complete the task. In this work we will focus on two kinds of environments: (nearly) reversible environments and environments that can be reset. Alice will "propose" the task by doing a sequence of actions and then Bob must undo or repeat them, respectively. Via an appropriate reward structure, Alice and Bob automatically generate a curriculum of exploration, enabling unsupervised training of the agent. When Bob is deployed on an RL task within the environment, this unsupervised training reduces the number of supervised episodes needed to learn, and in some cases converges to a higher reward.

研究の動機と目的

外部報酬なしに環境ダイナミクスを自動的に学習することで、強化学習エージェントに探索を促す。
二代理士（Alice と Bob）による自己対戦フレームワークを開発し、徐々に難易度が上がるタスクのカリキュラムを作成する。
自己対戦で生成された経験が、さまざまな連続・離散タスクでサンプル効率を向上させることを示す。

提案手法

内部報酬を伴う二-agent 設定: Bob のタスク完了は負の時間ベース報酬を生み出し、Bob が早く成功するほど Alice のタスク難易度が上がり、カリキュラムを作成する。
タスクを以前の状態へ戻ること、またはターゲット状態へ到達することとして表現できる、可逆的またはリセット可能な環境に適用できる。
Bob のポリシーは、自己対戦エピソードから得られた知識を用いてターゲットタスクを実行するように訓練される。
Alice と Bob のポリシーは表形式でもニューラルネットワークでも良く、いずれも状態観測とゴールを入力として取る。
訓練は自己対戦エピソードとターゲットタスクエピソードを、方策勾配法と共有ベースラインで組み合わせて行う。

Figure 1: Illustration of the self-play concept in a gridworld setting. Training consists of two types of episode: self-play and target task. In the former, Alice and Bob take turns moving the agent within the environment. Alice sets tasks by altering the state via interaction with its objects (key,

実験結果

リサーチクエスチョン

RQ1Alice を介して自律的に生成されるタスクは、Bob のダウンストリームターゲットタスクの学習を改善する教師なしカリキュラムを可能にするか？
RQ2自己対戦カリキュラムは、標準的な探索法と比較して、離散・連続環境の学習を加速するか？
RQ3可逆環境とリセット可能環境は、非対称自己対戦の設計と効果にどのように影響するか？
RQ4単純な理論設定において、自己対戦スキームは任意の状態-ゴール対を達成する高速ポリシー（Bob を普遍ポリシーとして）をどの程度学習できるか？

主な発見

非対称自己対戦は、複数のドメインにわたりターゲットタスクの学習を加速させる自動カリキュラムを生み出す。
自己対戦は、いくつかのベンチマークで最先端の探索法と同等かそれを上回ることができ、時には最終的なパフォーマンスを維持しつつ早い初期学習を達成する。
可逆的およびリセット可能な環境では、サンプル効率の向上を生み出し、場合によってはターゲットタスクの収束を早める。
本手法は表形式およびニューラルアーキテクチャの双方をサポートし、ポリシー勾配法と組み合わせた場合、連続制御タスクにも拡張できる。

Figure 2: Left: The hallway task from section 4.1 . The $y$ axis is fraction of successes on the target task, and the $x$ axis is the total number of training examples seen. Standard policy gradient (red) learns slowly. Adding an explicit exploration bonus (Strehl & Littman, 2008 ) (green) helps sig

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。