QUICK REVIEW

[論文レビュー] #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning

Haoran Tang, Rein Houthooft|arXiv (Cornell University)|Nov 15, 2016

Reinforcement Learning in Robotics参考文献 35被引用数 343

ひとこと要約

本論文は古典的なカウントベースの探索をハッシュを用いて高次元空間へ拡張し、静的および学習済みハッシュを用いて連続制御と Atari ベンチマークでほぼ最先端の結果を示す。

ABSTRACT

Count-based exploration algorithms are known to perform near-optimally when used in conjunction with tabular reinforcement learning (RL) methods for solving small discrete Markov decision processes (MDPs). It is generally thought that count-based methods cannot be applied in high-dimensional state spaces, since most states will only occur once. Recent deep RL exploration strategies are able to deal with high-dimensional continuous state spaces through complex heuristics, often relying on optimism in the face of uncertainty or intrinsic motivation. In this work, we describe a surprising finding: a simple generalization of the classic count-based approach can reach near state-of-the-art performance on various high-dimensional and/or continuous deep RL benchmarks. States are mapped to hash codes, which allows to count their occurrences with a hash table. These counts are then used to compute a reward bonus according to the classic count-based exploration theory. We find that simple hash functions can achieve surprisingly good results on many challenging tasks. Furthermore, we show that a domain-dependent learned hash code may further improve these results. Detailed analysis reveals important aspects of a good hash function: 1) having appropriate granularity and 2) encoding information relevant to solving the MDP. This exploration strategy achieves near state-of-the-art performance on both continuous control tasks and Atari 2600 games, hence providing a simple yet powerful baseline for solving MDPs that require considerable exploration.

研究の動機と目的

深層RLにおける高次元空間での頑健な探索を動機づける。
カウントベースの探索をハッシュ化を用いて連続/複雑な状態空間へ一般化する。
難しい深層RLベンチマークで単純なハッシュベースの探索を評価する。
ハッシュベースの探索を効果的にする要因を分析する（粒度、関連情報）。
一般的なDRLアルゴリズムと互換性のある高速で柔軟なベースラインを提供する。

提案手法

状態空間をハッシュ関数で離散化し、カウント n(φ(s)) を得て報酬に β / sqrt(n(φ(s))) のボーナスを追加する。
連続状態に対する実用的でスケーラブルなハッシュ法として SimHash を使用する（アルゴリズム1）。
学習済みハッシュを探索するため、オートエンコーダを訓練して2進コードを生成し、Then SimHashを適用して φ(s) を得る（アルゴリズム2）。
ハッシュ品質を改善するために、ドメイン知識による前処理（BASS のような静的特徴）や学習表現をオプションで適用する。
rllab の連続制御ベンチマークと Atari 2600 を対象に TRPO で評価し、ベースラインや以前の探索手法と比較する。

実験結果

リサーチクエスチョン

RQ1カウントベースの探索をハッシュ化で導入することは、連続制御と Atari ベンチマークのパフォーマンスを改善するか。
RQ2静的なハッシュコードと学習済みハッシュコードは、画像ベースの観測における探索性能にどのような影響を与えるか。
RQ3どのハッシュ関数の特性（粒度、情報量の多いエンコード）は、効果的な探索に最も寄与するか。
RQ4ハッシュベースの探索は最先端の深層RL探索手法と比較してどうか。

主な発見

ハッシュを用いたハッシュベースの探索は、いくつかのベンチマークでほぼ最先端のパフォーマンスを達成する。
静的 SimHash および学習済みハッシュ（AE-SimHash）は、複数の Atari ゲームと連続制御タスクでベースラインの TRPO を上回りうる。
ドメイン依存の前処理（BASS）や学習済みハッシュコードは、特に Montezuma’s Revenge や Venture のようなゲームで大幅な改善をもたらす。
単純なハッシュ手法でも、複雑な内的動機づけスキームなしに強力な探索信号を提供できる。
このアプローチは高速で柔軟、既存の DRL アルゴリズムと補完的である。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。