QUICK REVIEW

[論文レビュー] Unifying Count-Based Exploration and Intrinsic Motivation

Marc G. Bellemare, Sriram Srinivasan|arXiv (Cornell University)|Jun 6, 2016

Reinforcement Learning in Robotics参考文献 50被引用数 246

ひとこと要約

この論文は、密度モデルに由来する疑似カウントを導入し、非表形式設定へカウントベースの探索を一般化し、それを情報利得と結びつけることで、Monte zama’s Revengeを含むAtari 2600の難しいゲームにおける探索を改善する。

ABSTRACT

We consider an agent's uncertainty about its environment and the problem of generalizing this uncertainty across observations. Specifically, we focus on the problem of exploration in non-tabular reinforcement learning. Drawing inspiration from the intrinsic motivation literature, we use density models to measure uncertainty, and propose a novel algorithm for deriving a pseudo-count from an arbitrary density model. This technique enables us to generalize count-based exploration algorithms to the non-tabular case. We apply our ideas to Atari 2600 games, providing sensible pseudo-counts from raw pixels. We transform these pseudo-counts into intrinsic rewards and obtain significantly improved exploration in a number of hard games, including the infamously difficult Montezuma's Revenge.

研究の動機と目的

非表形式強化学習における探索問題の動機づけと、従来のカウントベース手法の限界。
状態間でのカウントを一般化する疑似カウントを導出する密度モデルベースの機構を提案。
疑似カウント、予測利得、情報利得との理論的結びつきを確立。
Monte zuma’s Revenge を含むAtari 2600のゲームで、疑似カウントボーナスの実用的有効性を、Actor-Critic およびリプレイ設定で実証。

提案手法

密度モデルの現在の確率と再符号化後の確率を rho_n および rho'_n を用いて関係づけることにより、疑似カウントを定義。
再符号化確率を用いて、経験的カウント N_n(x) を非表形式空間へ一般化する疑似カウント 0_hat_n(x) を導出。
疑似カウントを情報利得と予測利得に関連づけ、IG_n(x) ≤ PG_n(x) ≤ N_hat_n(x)^{-1} を証明し、 PG_n(x) ≤ N_hat_n(x)^{-1/2}。
疑似カウントベースの探索ボーナス R^+_n(x,a) = β (N_hat_n(x) + 0.01)^{-1/2} を MBIE-EB スタイルの計画および DQN/A3C フレームワークに適用。
単純なAtariの例（Freeway）で疑似カウントの性質を検証し、CTS 密度モデルをピクセルに適用してAtari 2600ゲームへ実験を拡張。

実験結果

リサーチクエスチョン

RQ1密度モデル由来の疑似カウントは非表形式状態空間への訪問カウントを一般化できるか。
RQ2疑似カウントは情報利得と予測利得にどのように関連し、探索に関する理論的保証を提供できるか。
RQ3疑似カウントベースのボーナスは、Monte zuma’s Revenge を含む難しいAtariゲームで、価値ベースおよびポリシーベースのRL手法の探索を改善するか。

主な発見

疑似カウントは非表形式設定における意味のある、一般化可能な状態の新規性の概念を提供する。
予測利得は情報利得を近似し、疑似カウントとの関係を通じて探索ボーナスを上界する。
疑似カウントボーナスは、特にMonte zuma’s Revenge などの難しいAtariゲームにおいて、ベースラインと比較して探索を大幅に改善する。
A3C に疑似カウントボーナスを組み込んだ(A3C+) は、60問のAtariゲームでA3C単独より中央値のパフォーマンスが向上。
CTS ベースの疑似カウントは、与えられたフレーム予算内でMonte zuma’s Revenge の探索を速め、スコアを高める。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。