QUICK REVIEW

[論文レビュー] A Distributional Perspective on Reinforcement Learning

Marc G. Bellemare, Will Dabney|arXiv (Cornell University)|Jul 21, 2017

Reinforcement Learning in Robotics参考文献 38被引用数 241

ひとこと要約

本論文は強化学習におけるリターンの全分布（value distributions）をモデル化することを主張し、政策評価のための distributional Bellman フレームワークを Wasserstein 距離の収縮性とともに導入し、制御設定における不安定性を分析し、離散分布学習アルゴリズム（categorical DQN）を提示して Atari で強力な結果を達成する。

ABSTRACT

In this paper we argue for the fundamental importance of the value distribution: the distribution of the random return received by a reinforcement learning agent. This is in contrast to the common approach to reinforcement learning which models the expectation of this return, or value. Although there is an established body of literature studying the value distribution, thus far it has always been used for a specific purpose such as implementing risk-aware behaviour. We begin with theoretical results in both the policy evaluation and control settings, exposing a significant distributional instability in the latter. We then use the distributional perspective to design a new algorithm which applies Bellman's equation to the learning of approximate value distributions. We evaluate our algorithm using the suite of games from the Arcade Learning Environment. We obtain both state-of-the-art results and anecdotal evidence demonstrating the importance of the value distribution in approximate reinforcement learning. Finally, we combine theoretical and empirical evidence to highlight the ways in which the value distribution impacts learning in the approximate setting.

研究の動機と目的

期待値 Q のみではなく、リターン Z の分布に焦点を当てることで、強化学習における分布的見方を動機づける。
政策評価と制御における distributional Bellman 演算子の理論的挙動を特徴づける。
近似値分布を学習する実用的なアルゴリズムを開発し、Atari ゲームでの経験的性能を評価する。

提案手法

方策の下での価値分布 Z を定義し、distributional Bellman 方程式を定式化する。
Wasserstein 指標を用いて policy evaluation における distributional Bellman 演算子の収縮性を分析する（T^π）。
制御設定における distributional 最適性演算子の不安定性を示す。収束性がなくなることや不動点の問題を含む。
Z をモデル化するためのパラメトリック離散分布（固定グリッド上のアトム）を提案し、サポートへの射影ベースの Bellman 更新を実装する（多クラス分類）。
projected Bellman 更新と現在の分布の KL 発散を最小化することで、categorical distributional DQN (C51) を訓練する。
Arcade Learning Environment の Atari 2600 ゲームで評価し、DQN 系列のベースラインと比較する。

実験結果

リサーチクエスチョン

RQ1全ての値分布をモデリングすることは、期待値のみを学ぶ場合に比べて理論的・経験的な利点をもたらすか。
RQ2distributional Bellman 演算子は、政策評価と制御設定のいずれかの適切な距離で収束であるか。
RQ3離散化された分布による実用的な分布近似を効果的に学習でき、Atari のような複雑なタスクで性能の向上が得られるか。

主な発見

distributional Bellman 演算子は政策評価において γ 収束で最大 Wasserstein 距離で収束し、真の値分布 Z^π へ収束する。
制御設定では distributional 最適性演算子は任意の分布間の距離で収束ではなく、固定点を持たない場合があり、 greedy 更新での不安定性を示す。
完全な値分布を学習することは多峰性を保持でき、関数近似と非定常方策の下でより安定した学習につながる。
射影によって学習される離散的なパラメトリック値分布（カテゴリアルゴリズム）は、いくつかの Atari ゲームで DQN を上回り、いくつかのタイトルで最先端の結果を達成する。
分布のアトム数を増やすと一般に性能が向上し、複数のゲームで DQN に対して大幅な改善を遂げる。
このアプローチは希少またはまれな報酬をより効果的に伝播させ、 sparse-reward なゲームでの性能を改善する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。