QUICK REVIEW

[论文解读] Revisiting Fundamentals of Experience Replay

William Fedus, Prajit Ramachandran|arXiv (Cornell University)|Jul 13, 2020

Smart Grid Energy Management被引用 83

一句话总结

本文系统性分析了Q-learning中的经验回放，结果表明更大的回放容量在一些算法（尤其是带n步回报的Rainbow）上可以提升性能，并且回放比率和数据年龄是关键因素。研究还发现，n步回报在更大回放缓存中具有唯一的收益能力，即使在高度离策略/离线式的设置下亦然。

ABSTRACT

Experience replay is central to off-policy algorithms in deep reinforcement learning (RL), but there remain significant gaps in our understanding. We therefore present a systematic and extensive analysis of experience replay in Q-learning methods, focusing on two fundamental properties: the replay capacity and the ratio of learning updates to experience collected (replay ratio). Our additive and ablative studies upend conventional wisdom around experience replay -- greater capacity is found to substantially increase the performance of certain algorithms, while leaving others unaffected. Counterintuitively we show that theoretically ungrounded, uncorrected n-step returns are uniquely beneficial while other techniques confer limited benefit for sifting through larger memory. Separately, by directly controlling the replay ratio we contextualize previous observations in the literature and empirically measure its importance across a variety of deep RL algorithms. Finally, we conclude by testing a set of hypotheses on the nature of these performance benefits.

研究动机与目标

区分回放容量与回放缓冲区中数据年龄对学习性能的影响。
识别哪些算法组件能够从更大的回放缓冲区中获得性能提升。
评估研究结果是否能推广到Rainbow以外的其他Q-learning变体（如DQN）。
研究将n步回报与回放容量及离策略数据之间的机制联系。
探讨对离线/批量RL设置的含义以及潜在的方差降低解释。

提出的方法

定义并测量回放容量（缓冲区大小）和最老策略的年龄（离策略性）。
引入回放比率，表示每个环境转移的梯度更新次数，以将数据流与学习更新解耦。
在以Rainbow为基础代理的大规模Atari实验中，按网格变量化回放容量和最老策略进行比较。
通过增减组件（PER、n步回报、Adam、C51）进行加法/消去研究，以孤立它们对回放容量增益的影响。
比较在线变体（DQN、Rainbow）与离线/批量RL设置，以测试发现的鲁棒性。

实验结果

研究问题

RQ1回放容量和回放缓冲区中转移数据年龄如何独立影响学习性能？
RQ2Rainbow哪些组件在更大回放缓冲区中提升性能，n步回报是否唯一负责？
RQ3发现是否可推广到DQN等其他Q-learning变体，是否在离线/批量RL设置下成立？
RQ4有哪些机制（如方差降低、离策略性）解释为何n步回报在更大回放中能带来收益？
RQ5在可扩展的离策略深度RL代理中设计回放数据生成的实际含义是什么？

主要发现

Agent	Fixed replay ratio improvement	Fixed oldest policy improvement
DQN	+0.1%	-0.4%
Rainbow	+28.7%	+18.3%

在允许最老策略年龄增长时，增加回放容量通常在Atari游戏中提升性能。
降低最老策略年龄（增加更多在策略内数据）也倾向于提升性能，尤其在较大缓冲区下。
n步回报对从更大回放容量获得收益具有唯一的关键作用；移除n步回报将阻止从更大缓冲区获得收益。
DQN在更大的回放缓冲区中并无收益，而Rainbow（带n步）确实受益，表明这种交互依赖于架构。
在离线/批量RL中，使用n步回报（n>1）在高度离策略数据下也能提升性能，支持多步回报的广义相关性。
在研究设置中，优先经验回放（PER）并未显著驱动大型记忆带来的增益。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。