QUICK REVIEW

[論文レビュー] Learning with AMIGo: Adversarially Motivated Intrinsic Goals

Andres Campero, Roberta Răileanu|arXiv (Cornell University)|Jun 22, 2020

Adversarial Robustness in Machine Learning参考文献 57被引用数 48

ひとこと要約

AMIGo は、 student policy のためにますます困難な intrinsic goals を提案する goal-generating teacher を訓練し、希薄報酬環境での学習を可能にする自律的カリキュラムを作成し、 challenging MiniGrid tasks における最先端の intrinsic motivation baselines を上回る。

ABSTRACT

A key challenge for reinforcement learning (RL) consists of learning in environments with sparse extrinsic rewards. In contrast to current RL methods, humans are able to learn new skills with little or no reward by using various forms of intrinsic motivation. We propose AMIGo, a novel agent incorporating -- as form of meta-learning -- a goal-generating teacher that proposes Adversarially Motivated Intrinsic Goals to train a goal-conditioned "student" policy in the absence of (or alongside) environment reward. Specifically, through a simple but effective "constructively adversarial" objective, the teacher learns to propose increasingly challenging -- yet achievable -- goals that allow the student to learn general skills for acting in a new environment, independent of the task to be solved. We show that our method generates a natural curriculum of self-proposed goals which ultimately allows the agent to solve challenging procedurally-generated tasks where other forms of intrinsic motivation and state-of-the-art RL methods fail.

研究の動機と目的

sparse-reward RL タスクを intrinsic motivation により解決可能な学習フレームワークを動機づける。
学習者ポリシーのために挑戦的かつ達成可能な intrinsic goals を生成する goal-generating teacher を提案する。
自動カリキュラム生成が procedurally generated environments でのサンプル効率とタスク一般化を改善することを実証する。

提案手法

AMIGo を導入する。これは goal-generating teacher G と goal-conditioned student π からなる。
教師 G はエピソード開始時または到達時に goal g を提案する。学生は基づく r^g による intrinsic reward と環境報酬 r^e を受け取る。
教師報酬 r^T は student が hard だが達成可能なゴールに報いる。ゴール難易度を制御する閾値 t* を用いる。
学生は報酬 r_t = r^g_t + r^e_t を用いて割引報酬の総和 R_t を最適化する。
環境は procedurally generated layouts を持つ MiniGrid タスクであり、ゴールは座標 (x,y) を介してタイル観測の変化として定義される。
補助的損失セット（多様性、エピソード境界の認識、外的ゴールの整合性）は検討されるが、AMIGo にとって必須ではない。

実験結果

リサーチクエスチョン

RQ1 goal-generating teacher が学生ポリシーのためにタスク難易度を徐々に高める intrinsic motivation のカリキュラムを学習できるか？
RQ2 AMIGo は難易度の高い procedurally generated MiniGrid 環境において最先端の intrinsic motivation baselines を上回るか？
RQ3 補助的教師損失のアブレーションが学習と性能にどのように影響するか？
RQ4 フレームワークのアーキテクチャ依存性はなく、異なる RL モデルやゴールモーダリティへ適応可能か？

主な発見

モデル	KCmedium	OMmedium	OMmedhard	KChard	KCharder	OMhard
AMIGo	.93 \\pm .00	.92 \\pm .00	.83 \\pm .05	.54 \\pm .45	.44 \\pm .44	.17 \\pm .34

AMIGo はいくつかの難しい MiniGrid タスクで最先端の結果を達成し、他の方法が解けない環境も解決する。
AMIGo は intrinsic motivation を用いない IMPALA や Count, ICM, RND, RIDE, ASP などのベースラインを難解な環境で上回る。
中程度の環境では、AMIGo は他の強力な intrinsic motivation 手法と同程度の性能を示す。
AMIGo は教師が学生の向上に合わせてゴール難易度を上げていく自然なカリキュラムを示す。
定性的分析は、学習が進むにつれて教師と学生の協調的・対立的なダイナミクスが見られることを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。