QUICK REVIEW

[論文レビュー] Fully Parameterized Quantile Function for Distributional Reinforcement Learning

Derek Yang, Zhao Li|arXiv (Cornell University)|Nov 5, 2019

Evolutionary Algorithms and Applications被引用数 39

ひとこと要約

本論文は、分布強化学習のための Fully Parameterized Quantile Function (FQF) を提案し、二つのネットワークで分位数の分割と分位値を同時に学習してリターン分布をより良く近似し、強力なAtariの結果を達成します。

ABSTRACT

Distributional Reinforcement Learning (RL) differs from traditional RL in that, rather than the expectation of total returns, it estimates distributions and has achieved state-of-the-art performance on Atari Games. The key challenge in practical distributional RL algorithms lies in how to parameterize estimated distributions so as to better approximate the true continuous distribution. Existing distributional RL algorithms parameterize either the probability side or the return value side of the distribution function, leaving the other side uniformly fixed as in C51, QR-DQN or randomly sampled as in IQN. In this paper, we propose fully parameterized quantile function that parameterizes both the quantile fraction axis (i.e., the x-axis) and the value axis (i.e., y-axis) for distributional RL. Our algorithm contains a fraction proposal network that generates a discrete set of quantile fractions and a quantile value network that gives corresponding quantile values. The two networks are jointly trained to find the best approximation of the true distribution. Experiments on 55 Atari Games show that our algorithm significantly outperforms existing distributional RL algorithms and creates a new record for the Atari Learning Environment for non-distributed agents.

研究の動機と目的

平均リターンのみの推定にとどまらず、より近い分布の近似による分布推定RLを動機づける。
quantile fractions と対応する quantile values の両方を学習する完全にパラメータ化された分位関数を提案する。
Wasserstein 距離を最小化するための fraction proposal network と quantile value network の訓練スキームを開発する。
既存の分布RL法と比較して55ゲームのAtariで最先端の性能を実証する。

提案手法

Z_{θ,τ}(x,a) を学習可能な分位値 θ_i と分位数 τ_i の混合として定義する（式(1)）。
真の分位関数と近似分位関数の間の1-Wasserstein損失 W1 を定義し（式(2)）、この損失を最小化するように τ を最適化する方法を示す（命題1/式(4)–(5)）。
状態-行動ペアごとにソートされた分位数 τ を生成する分数提案ネットワークを使用する（セクション3.4）。
埋め込み φ(τ) を用いて埋め込みから分位値 F^{-1}_{Z,w2}(τ) をマッピングする分位値ネットワークを使用する（埋め込み φ(τ) と状態特徴量との Hadamard 積）。
Huber 損失に基づく分位インデックスの対のペアに対して、分位回帰損失を介して両方のネットワークを共同訓練する（式7）。

実験結果

リサーチクエスチョン

RQ1完全にパラメータ化された分位関数（学習した分位数と対応する値の両方を学習する）が、固定またはサンプリングされた分位数よりも真の分布との Wasserstein 距離をより効果的に低減できるか？
RQ2自己調整型分数学習は、従来の IQN/QR-DQN アプローチよりも分布近似と学習速度が向上するか？
RQ3FQF は Atari 55ゲームで C51、QR-DQN、IQN、Rainbow、その他のベースラインと比較してどう機能するか？
RQ4分数提案ネットワークを導入する際の訓練速度と分布近似品質の現実的なトレードオフはどうなるか？

主な発見

アルゴリズム	平均	中央値	>人間	>DQN
DQN	221%	79%	24	0
PRIOR.	580%	124%	39	48
C51	701%	178%	40	50
RAINBOW	1213%	227%	42	52
QR-DQN	902%	193%	41	54
IQN	1112%	218%	39	54
FQF	1426 %	272 %	44	54

FQF は 55 自作のAtariゲームで既存の分布強化学習法を上回り、非分散エージェントの新記録を樹立した。
平均スコアと中央値の人間正規化スコアは FQF が 1426% 平均、272% 中央値を達成し、IQN、Rainbow、QR-DQN、C51、DQN のベースラインを上回った。
訓練曲線は、自己調整型分数により多くのゲームで IQN より generally速いことを示す一方、追加の分数提案ネットワークの影響で全体的にはわずかに遅くなる。
専用の表は、複数のベースラインに対する顕著な改善を報告し、分位数と値の両方を学習する効果を示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。