QUICK REVIEW

[論文レビュー] Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents

Edoardo Conti, Vashisht Madhavan|arXiv (Cornell University)|Dec 18, 2017

Reinforcement Learning in Robotics参考文献 51被引用数 115

ひとこと要約

この論文は novelty-search と quality-diversity の探索を進化戦略 (ES) と組み合わせて深層強化学習の探索を改善し、NS-ES、NSR-ES、NSRA-ES を生み出し、欺瞞的またはスパース報酬タスクにおいて ES を上回りつつ ES のスケーラビリティを維持します。

ABSTRACT

Evolution strategies (ES) are a family of black-box optimization algorithms able to train deep neural networks roughly as well as Q-learning and policy gradient methods on challenging deep reinforcement learning (RL) problems, but are much faster (e.g. hours vs. days) because they parallelize better. However, many RL problems require directed exploration because they have reward functions that are sparse or deceptive (i.e. contain local optima), and it is unknown how to encourage such exploration with ES. Here we show that algorithms that have been invented to promote directed exploration in small-scale evolved neural networks via populations of exploring agents, specifically novelty search (NS) and quality diversity (QD) algorithms, can be hybridized with ES to improve its performance on sparse or deceptive deep RL tasks, while retaining scalability. Our experiments confirm that the resultant new algorithms, NS-ES and two QD algorithms, NSR-ES and NSRA-ES, avoid local optima encountered by ES to achieve higher performance on Atari and simulated robots learning to walk around a deceptive trap. This paper thus introduces a family of fast, scalable algorithms for reinforcement learning that are capable of directed exploration. It also adds this new family of exploration algorithms to the RL toolbox and raises the interesting possibility that analogous algorithms with multiple simultaneous paths of exploration might also combine well with existing RL algorithms outside ES.

研究の動機と目的

疎寄与・欺瞞的な報酬を持つ深層強化学習における指向的探索の必要性を動機づける。
novelty search (NS) と quality diversity (QD) を進化戦略 (ES) に統合する方法を紹介する。
NS-ES, NSR-ES, NSRA-ES を開発し、スケール可能な集団ベースの探索を実現する。
Atari やシミュレートロボティクスを含む高次元タスクで提案手法を評価し、ES より性能が改善されることを示す。

提案手法

ES をネットワークパラメータの集団分布に対する勾配上昇として表現する。
行動アーカイブに対する期待的 novelty を最大化する NS-ES を導入する。
ノベルティと報酬信号をランク正規化平均で組み合わせることで NSR-ES に拡張する。
探索と活用のバランスを取るため、 novelty と報酬の間の適応的ウェイト w を用いた NSRA-ES を開発する。
更新のために novelty に基づく確率選択を用いた M 人のエージェントのメタ集団を使用する。
大規模 DNN に適したアルゴリズム的詳細と並列化可能な実装を提供する。

実験結果

リサーチクエスチョン

RQ1 novelty-seeking 戦略 (NS および QD) はスケーラビリティを犠牲にすることなく、疎/欺瞞的な RL タスクで ES の性能を向上させることができるか？
RQ2 NS-ES, NSR-ES, NSRA-ES は高次元領域で標準 ES が陥る局所最適を回避するか？
RQ3 novelty と reward の適応的ウェイト付け (NSRA-ES) は多様な環境で堅牢な性能を提供するか？

主な発見

ゲーム	ES	NS-ES	NSR-ES	NSRA-ES	DQN	NoisyDQN	A3C+
Alien	3283.8	1124.5	2186.2	4846.4	2404	2403	1848.3
Amidar	322.2	134.7	255.8	305.0	924	1610	964.7
Bank Heist	140.0	50.0	130.0	152.9	455	1068	991.9
Beam Rider	871.7	805.5	876.9	906.4	10564	20793	5992.1
Freeway	31.1	22.8	32.3	32.9	31	32	27.3
Frostbite	367.4	250.0	2978.6	3785.4	1000	753	506.6
Gravitar	1129.4	527.5	732.9	1140.9	366	447	246.02
Montezuma	0.0	0.0	0.0	0.0	2	3	142.5
Ms. Pacman	4498.0	2252.2	3495.2	5171.0	2674	2722	2380.6
Seaquest	960.0	1044.5	2329.7	960.0	4163	2282	2274.1
Zaxxon	9885.0	1761.9	6723.3	7303.3	4806	6920	7956.1

NS-ES と二つの QD-ES 変種（NSR-ES、NSRA-ES）は ES を捕らえる局所最適を回避し、 Atari とシミュレートされた歩行タスクでより高い性能を達成する。
NS-ES は報酬信号を無視する設定でも novelty のみでヒューマノイドの移動を解くことができる。
NSR-ES は novelty を維持しつつ報酬を取り入れて学習を加速する； NSRA-ES は適応的に novelty と報酬を重み付けすることでしばしば全体性能で最多を出す。
Atari 実験では、 NS-ES と特に NSRA-ES が複数のゲームで ES より優れており、複数回の実行で DQN や A3C+ の伝統的探索法と中央値報酬で競合または上回る。
NSRA-ES は探索圧力を適応させることでロバスト性を示す；評価されたゲームの大多数で ES より高い中央値報酬を達成（8/12 NSRA-ES vs ES、9/12 NSR-ES vs ES）。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。