QUICK REVIEW

[論文レビュー] Distributional Reinforcement Learning with Quantile Regression

Will Dabney, Mark Rowland|arXiv (Cornell University)|Oct 27, 2017

Sports Analytics and Performance被引用数 150

ひとこと要約

この論文は、Wasserstein metric の下で quantile regression を用いて強化学習の価値分布をエンドツーエンドで学習する方法を示し、qr-dqn を導入し、Atari の最先端結果を達成します。

ABSTRACT

In reinforcement learning an agent interacts with the environment by taking actions and observing the next state and reward. When sampled probabilistically, these state transitions, rewards, and actions can all induce randomness in the observed long-term return. Traditionally, reinforcement learning algorithms average over this randomness to estimate the value function. In this paper, we build on recent work advocating a distributional approach to reinforcement learning in which the distribution over returns is modeled explicitly instead of only estimating the mean. That is, we examine methods of learning the value distribution instead of the value function. We give results that close a number of gaps between the theoretical and algorithmic results given by Bellemare, Dabney, and Munos (2017). First, we extend existing results to the approximate distribution setting. Second, we present a novel distributional reinforcement learning algorithm consistent with our theoretical formulation. Finally, we evaluate this new algorithm on the Atari 2600 games, observing that it significantly outperforms many of the recent improvements on DQN, including the related distributional algorithm C51.

研究の動機と目的

強化学習において平均だけでなくリターンの全分布をモデル化する動機付け。
Wasserstein metric の下でエンドツーエンド最適化を可能にすることにより、理論と実践のギャップを埋める。
quantile regression を用いて価値分布を学習する実用的なアルゴリズム（qr-dqn）を開発する。
従来の分布表現法と比べて Atari 2600 のベンチマークで優れた性能を示す。

提案手法

c51 のような固定位置・一様確率分布を、固定された分位点の位置と一様重みに置換し、リターン分布の分位点を実質的に推定する。
ターゲットと予測分布の Wasserstein-1 距離を最小化するために quantile regression を用い、バイアスのない確率的勾配更新を可能にする。
Wasserstein metric の下で、結合された quantile projection と distributional Bellman 演算子の収束性（縮約性）を証明する。
policy 評価のための quantile regression TD (qrtd) を導出・実装し、制御のための QR-DQN (qr-dqn) を導入し、選択肢として quantile-Huber loss を用意する。
DQN アーキテクチャを各アクションあたり N 個の分位点を出力するよう適応させ、標準の TD loss の代わりに quantile regression loss で学習する。
Gridworld 風タスクおよび 57 台の Atari 2600 ゲームで経験的に検証し、c51 および DQN 系と比較する。

実験結果

リサーチクエスチョン

RQ1quantile regression を用いて Wasserstein metric の下でエンドツーエンドに学習できる distributional RL アルゴリズムは実現可能か？
RQ2投影ステップなしの quantile ベース分布表現は、c51 などの既存手法より安定性と性能を向上させるか？
RQ3qr-dqn は Atari 2600 ベンチマークで最先端の性能を達成し、従来の分布表現法とどう比較されるか？
RQ4quantile projection と distributional Bellman 演算子を組み合わせたときの理論的な収束特性は？
RQ5quantile regression（Huber スムージングの有無）は Distributional RL の学習ダイナミクスと最終性能にどのような影響を与えるか？

主な発見

平均	中央値	> 人間	> DQN
dqn	228%	79%	24	0
ddqn	307%	118%	33	43
Duel.	373%	151%	37	50
Prior.	434%	124%	39	48
Pr. Duel.	592%	172%	39	44
c51	701%	178%	40	50
qr-dqn - 0	881%	199%	38	52
qr-dqn - 1	915%	211%	41	54

固定された quantile の位置に対して一様重みを持つ quantile-based distributional RL アルゴリズムは、Wasserstein 距離の下で分布的固定点へ収束する。
quantile projection を伴う結合演算子は infinity-Wasserstein metric における縮約であり、収束を保証する。
qr-dqn は Atari 2600 ベンチマークで従来法（c51 を含む）を上回り、平均値と中央値の人間正規化スコアでより高い値を達成する。
quantile-Huber loss の使用は、厳密な quantile loss よりも追加の性能向上をもたらす。
実証結果は、qrtd が風の強い gridworld 設定で真の分布への 1-Wasserstein 距離を正確に最小化することを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。