QUICK REVIEW

[论文解读] Fully Parameterized Quantile Function for Distributional Reinforcement Learning

Derek Yang, Zhao Li|arXiv (Cornell University)|Nov 5, 2019

Evolutionary Algorithms and Applications被引用 39

一句话总结

论文为分布式强化学习引入 Fully Parameterized Quantile Function（FQF），通过两个网络共同学习分位数分数与分位值以更好地近似回报分布，取得强劲的 Atari 结果。

ABSTRACT

Distributional Reinforcement Learning (RL) differs from traditional RL in that, rather than the expectation of total returns, it estimates distributions and has achieved state-of-the-art performance on Atari Games. The key challenge in practical distributional RL algorithms lies in how to parameterize estimated distributions so as to better approximate the true continuous distribution. Existing distributional RL algorithms parameterize either the probability side or the return value side of the distribution function, leaving the other side uniformly fixed as in C51, QR-DQN or randomly sampled as in IQN. In this paper, we propose fully parameterized quantile function that parameterizes both the quantile fraction axis (i.e., the x-axis) and the value axis (i.e., y-axis) for distributional RL. Our algorithm contains a fraction proposal network that generates a discrete set of quantile fractions and a quantile value network that gives corresponding quantile values. The two networks are jointly trained to find the best approximation of the true distribution. Experiments on 55 Atari Games show that our algorithm significantly outperforms existing distributional RL algorithms and creates a new record for the Atari Learning Environment for non-distributed agents.

研究动机与目标

推动分布式强化学习在不仅仅估计均值回报的基础上更接近分布的近似。
提出一个 Fully Parameterized Quantile Function，能够同时学习分位数分数及其对应的值。
开发一个用于分数提议网络和分位值网络的训练方案，以最小化 Wasserstein 距离。
在 55 个人 Atari 游戏上展现与现有分布式 RL 方法相比的 state-of-the-art 性能。

提出的方法

引入 Z_{θ,τ}(x,a) 作为 N 个 Dirac 的混合，其学习的分位值 θ_i 与分位分数 τ_i（式（1））。
定义真实分位函数与近似分位函数之间的 1-Wasserstein 损失 W1（式（2）），并给出如何优化 τ 以最小化该损失（命题 1/式（4）-（5））。
使用分数提议网络为每个状态-动作对生成排好序的分位分数 τ（第 3.4 节）。
使用分位值网络将嵌入 τ 映射到分位值 F^{-1}_{Z,w2}(τ)（嵌入 φ(τ) 与状态特征的 Hadamard 乘积）。
通过基于对分位索引对的 Hubber 损失的分位回归损失联合训练两个网络（式（7））。

实验结果

研究问题

RQ1学习分位数分数及其对应值（完全参数化分位函数）是否比固定或采样的分数在减少 Wasserstein 距离到真实分布方面更有效？
RQ2自适应分数学习是否比传统的 IQN/QR-DQN 方法在分布近似和学习速度上表现更好？
RQ3FQF 在 55 个 Atari 游戏相对于 C51、QR-DQN、IQN、Rainbow 以及其他基线的表现如何？
RQ4在引入分数提议网络时，训练速度与分布近似质量之间的实际权衡是什么？

主要发现

算法	Mean	Median	>Human	>DQN
DQN	221%	79%	24	0
PRIOR.	580%	124%	39	48
C51	701%	178%	40	50
RAINBOW	1213%	227%	42	52
QR-DQN	902%	193%	41	54
IQN	1112%	218%	39	54
FQF	1426 %	272 %	44	54

FQF 在 55 个 Atari 游戏上超越现有分布式强化学习方法，为非分布式智能体设定新纪录。
均值和中位数的人类归一化分数显示 FQF 达到 1426% 均值与 272% 中位数，超过 IQN、Rainbow、QR-DQN、C51 与 DQN 基线。
训练曲线表明由于自适应分数，FQF 在许多游戏上通常比 IQN 更快，尽管由于额外的分数提议网络总体略慢。
专门的表格报告在多项基线上的显著收益，说明学习分位数分数与值的有效性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。