QUICK REVIEW

[论文解读] Distributional Reinforcement Learning with Quantile Regression

Will Dabney, Mark Rowland|arXiv (Cornell University)|Oct 27, 2017

Sports Analytics and Performance被引用 150

一句话总结

本论文展示了如何通过在 Wasserstein 度量下使用分位数回归，在强化学习中端到端学习价值分布，提出 qr-dqn，并在 At avi 结果中达到最先进水平。

ABSTRACT

In reinforcement learning an agent interacts with the environment by taking actions and observing the next state and reward. When sampled probabilistically, these state transitions, rewards, and actions can all induce randomness in the observed long-term return. Traditionally, reinforcement learning algorithms average over this randomness to estimate the value function. In this paper, we build on recent work advocating a distributional approach to reinforcement learning in which the distribution over returns is modeled explicitly instead of only estimating the mean. That is, we examine methods of learning the value distribution instead of the value function. We give results that close a number of gaps between the theoretical and algorithmic results given by Bellemare, Dabney, and Munos (2017). First, we extend existing results to the approximate distribution setting. Second, we present a novel distributional reinforcement learning algorithm consistent with our theoretical formulation. Finally, we evaluate this new algorithm on the Atari 2600 games, observing that it significantly outperforms many of the recent improvements on DQN, including the related distributional algorithm C51.

研究动机与目标

在强化学习中模仿将回报的完整分布建模，而不仅仅是均值。
通过在 Wasserstein 度量下实现端到端优化来弥合理论与实践之间的差距。
开发一种实用算法（qr-dqn），该算法使用分位数回归来学习价值分布。
在 Atari 2600 基准上展示相对于先前分布式方法的性能优势。

提出的方法

用固定分位点的固定量纲权重替换如 c51 那样的固定位置、均匀概率分布，实质上估计回报分布的分位数。
使用分位数回归来最小化目标分布与预测分布之间的 Wasserstein-1 距离，从而实现无偏的随机梯度更新。
证明结合分位数投影和分布式 Bellman 操作符在 Wasserstein 度量下的收缩性质。
推导并实现用于策略评估的分位数回归时间差（qrtd）和用于控制的 QR-DQN（qr-dqn），并可选使用分位数-Huber 损失。
将 DQN 架构改造成对每个动作输出 N 个分位数，并使用分位数回归损失而非标准时间差损失进行训练。
在网格世界样任务和 57 个 Atari 2600 游戏上进行经验验证，并与 c51 与 DQN 变体进行比较。

实验结果

研究问题

RQ1一个分布式强化学习算法是否能够在 Wasserstein 度量下使用分位数回归实现端到端优化？
RQ2是否存在一个基于分位数的分布表示，在没有投影步骤的情况下，能否在稳定性和性能上优于现有方法如 c51？
RQ3qr-dqn 是否在 Atari 2600 基准上达到状态最优表现，并且与先前的分布式方法相比如何？
RQ4将分位数投影与分布式 Bellman 操作符结合时，理论上的收缩性质是什么？
RQ5分位数回归（有无 Hedge 平滑）如何影响 Distributional RL 的学习动力学和最终性能？

主要发现

均值	中位数	> 人类	> DQN
dqn	228%	79%	24	0
ddqn	307%	118%	33	43
Duel.	373%	151%	37	50
Prior.	434%	124%	39	48
Pr. Duel.	592%	172%	39	44
c51	701%	178%	40	50
qr-dqn - 0	881%	199%	38	52
qr-dqn - 1	915%	211%	41	54

一个基于分位数的分布式 RL 算法，在固定分位点的均匀权重下，在 Wasserstein 距离下收敛到分布式的不变点。
与分位数投影结合的算子在无穷 Wasserstein 距离度量下是收缩的，确保收敛性。
qr-dqn 在 Atari 2600 基准上优于先前方法（包括 c51），达到更高的均值和中位数的人类归一化分数。
使用分位数-Huber 损失相较严格的分位数损失可带来额外的性能提升。
经验结果显示 qrtd 在风大网格环境中能准确地将 1-Wasserstein 距离最小化到地面真实分布。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。