QUICK REVIEW

[論文レビュー] Go-Explore: a New Approach for Hard-Exploration Problems

Adrien Ecoffet, Joost Huizinga|arXiv (Cornell University)|Jan 30, 2019

Reinforcement Learning in Robotics参考文献 99被引用数 225

ひとこと要約

Go-Explore は、promising states を記憶する2段階アルゴリズムを導入し、探索なしでそれらに戻り、そこから探索し、最終的に模倣学習でロバスト化して、難易度の高い探索タスクで超人並みのAtari性能を達成する。

ABSTRACT

A grand challenge in reinforcement learning is intelligent exploration, especially when rewards are sparse or deceptive. Two Atari games serve as benchmarks for such hard-exploration domains: Montezuma's Revenge and Pitfall. On both games, current RL algorithms perform poorly, even those with intrinsic motivation, which is the dominant method to improve performance on hard-exploration domains. To address this shortfall, we introduce a new algorithm called Go-Explore. It exploits the following principles: (1) remember previously visited states, (2) first return to a promising state (without exploration), then explore from it, and (3) solve simulated environments through any available means (including by introducing determinism), then robustify via imitation learning. The combined effect of these principles is a dramatic performance improvement on hard-exploration problems. On Montezuma's Revenge, Go-Explore scores a mean of over 43k points, almost 4 times the previous state of the art. Go-Explore can also harness human-provided domain knowledge and, when augmented with it, scores a mean of over 650k points on Montezuma's Revenge. Its max performance of nearly 18 million surpasses the human world record, meeting even the strictest definition of "superhuman" performance. On Pitfall, Go-Explore with domain knowledge is the first algorithm to score above zero. Its mean score of almost 60k points exceeds expert human performance. Because Go-Explore produces high-performing demonstrations automatically and cheaply, it also outperforms imitation learning work where humans provide solution demonstrations. Go-Explore opens up many new research directions into improving it and weaving its insights into current RL algorithms. It may also enable progress on previously unsolvable hard-exploration problems in many domains, especially those that harness a simulator during training (e.g. robotics).

研究の動機と目的

稀疎・欺瞞的な報酬設定における探索の課題に対処する。
intrinsic motivation に過度に依存せず、難易度の高い探索 Atari ベンチマークで性能を向上させる。
有望な状態のアーカイブを用いた探索を第1相、模倣学習による堅牢化を第2相とする2段階フレームワークを開発する。

提案手法

有望な状態（セル）とそれらへ到達する方法をアーカイブに保管する。
各ステップでセルを選択し、それに決定論的に戻り、次に確率的な行動でそこから探索する。
新しいセルが発見されたり、より良い軌道が現れたときに、軌道、状態、スコア、長さを含むアーカイブを更新する。
セルの表現を2つ用いる：ドメイン知識を持たない、単純なダウンサンプリング済みのグレースケール11x8画像と、ドメイン知識を豊富に用いた表現（例：エージェントの位置、部屋、鍵など）。
第2相は模倣学習（Backward Algorithm）を用いて第1相の軌跡を堅牢化し、軌跡の末尾付近から学習を始め、PPOとともに徐々に先頭方向へ進むことで、元のスコアを達成またはそれを上回るまで訓練する。
本手法は決定論的訓練（Phase 1）と確率的評価を区別し、決定論的リセットを可能にしたうえで、堅牢性のためPhase 2で確率的性を追加する。

実験結果

リサーチクエスチョン

RQ1事前に探索前に、過去に訪れた状態の明示的なアーカイブを用い、それらを辿ることで難しい探索タスクの性能を改善できるか？
RQ2決定論的探索とその後の模倣学習による堅牢化を組み合わせると、疎で欺瞞的な報酬設定でスケーラブルな改善が得られるか？
RQ3セル表現におけるドメイン知識は、難しい探索ベンチマークでの発見と性能をどの程度加速するか？
RQ4Go-Explore は従来の内部報酬アプローチが抱える脱落・脱線問題をどう対処するか？
RQ5ドメイン知識あり/なしで、Montezuma’s Revenge と Pitfall の性能向上はどの程度か？

主な発見

Montezuma’s Revenge では、ドメイン知識なしで、Go-Explore は 43,000 点を超え（従来の最先端のほぼ4倍）。
容易に提供できるドメイン知識を用いると、Go-Explore は平均で65万点超、最大で1800万点超を達成し、人間の世界記録を大きく上回る。
Pitfall では、ドメイン知識を用いた Go-Explore は平均約59,494点、最大107,363点を達成し、ゲームの可能最大値に近い。
ドメイン知識なしの場合、Montezuma’s Revenge の平均点は43,763で、従来の研究よりもなお大幅に高い。
Go-Explore は模倣学習に適した高性能なデモを自動的に生成でき、従来の人間デモに依存した模倣学習結果を上回る。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。