QUICK REVIEW

[論文レビュー] The Uncertainty Bellman Equation and Exploration

Brendan O’Donoghue, Ian Osband|arXiv (Cornell University)|Sep 15, 2017

Simulation Techniques and Applications参考文献 38被引用数 58

ひとこと要約

不確実性ベルマン方程式（UBE）を導入し、時系列をまたいで後方のQ値不確実性を伝搬させることで深部探索を可能にする。学習済みの不確実性に対してε-greedyを Thompson sampling に置換した場合、Atari でDQNの性能が実証的に向上する。

ABSTRACT

We consider the exploration/exploitation problem in reinforcement learning. For exploitation, it is well known that the Bellman equation connects the value at any time-step to the expected value at subsequent time-steps. In this paper we consider a similar extit{uncertainty} Bellman equation (UBE), which connects the uncertainty at any time-step to the expected uncertainties at subsequent time-steps, thereby extending the potential exploratory benefit of a policy beyond individual time-steps. We prove that the unique fixed point of the UBE yields an upper bound on the variance of the posterior distribution of the Q-values induced by any policy. This bound can be much tighter than traditional count-based bonuses that compound standard deviation rather than variance. Importantly, and unlike several existing approaches to optimism, this method scales naturally to large systems with complex generalization. Substituting our UBE-exploration strategy for $ε$-greedy improves DQN performance on 51 out of 57 games in the Atari suite.

研究の動機と目的

強化学習における不確実性の伝搬を通じた探索の動機づけと形式化。
不確実性ベルマン方程式（UBE）を定義し、その不動点特性を確立する。
局所的不確実性を推定し、UBEを深層RLに組み込む実用的手法を提案する。
Atari における標準的な epsilon-greedy 戦略よりも UBE 主導の探索の経験的利得を示す。

提案手法

Q値の事後分散のベルマン様方程式（UBE）を導出し、その唯一の不動点が事後分散を上側で制限することを証明する。
Var_t(hat{mu}) と Var_t(hat{P}) によって局所的不確実性 ν を定義し、var_t(hat{Q}) の計算可能な上限を提供する。
UBE を解いて var_t(hat{Q}) を上回る不確実性 u を得て、それを Thompson sampling に似た行動選択（式3）に用いる。
表形式、線形、ニューラルネットワーク設定における局所的不確実性の実用的推定方法を説明し、ベイズ線形推定の Sherman-Morrison-Woodbury 更新を含む。
2 ヘッドネットワーク（Q と不確実性）を用いた深層 RL への拡張と、1 ステップの UBE 探索アルゴリズム（Algorithm 1）を提案。
Atari 実験における UBE ベースの探索を、カウントベースのボーナスおよび epsilon-greedy と比較する。

実験結果

リサーチクエスチョン

RQ1UBE を用いたベルマン風再帰（UBE）でQ値の不確実性を時系列で伝搬できるか？
RQ2UBE を解くことは、事後Q値分散の意味ある上界を提供し探索効率を改善するか？
RQ3UBE に触発された不確実性を用いる実用的な深層RLアルゴリズムは、標準の探索戦略より難解な環境で性能を向上させるか？
RQ4UBE ベースの探索のために、表形式・線形・ニューラルネットワーク設定で局所的不確実性をどう推定すべきか？

主な発見

Algorithm	mean	median	> human
DQN	688.60	79.41	21
DQN Intrinsic Motivation	472.93	76.73	24
DQN UBE 1-step	776.40	94.54	26
DQN UBE n-step	439.88	126.41	35

UBE には、任意の方針の下で事後のQ値分散に対する点ごとの上界を与える唯一の不動点がある。
カウントベースのボーナスと比べ、UBE ベースの探索は時間を跨いだ不確実性の伝搬により大規模かつ一般化可能なシステムへ効率的にスケールできる。
Atari 実験で、epsilon-greedy を learned uncertainty head over に置換することで性能が向上し、n-step UBE 変種は57試合中32試合で最も良い成績を達成。
二重頭脳ネットワークを用いて、Q値と不確実性を同時に学習でき、計算オーバーヘッドはほとんどなし（フレームレート約10%低下）である。
本手法は vanilla DQN を上回る顕著な改善を示し、 intrinsic motivation 手法と比較して競争力のある性能を示し、いくつかのゲームで超人間性能を達成している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。