QUICK REVIEW

[论文解读] Prioritized Sequence Experience Replay

Marc Brittain, Joshua R. Bertram|arXiv (Cornell University)|May 25, 2019

Reinforcement Learning in Robotics参考文献 22被引用 33

一句话总结

本论文提出了优先序列经验回放（PSER），是对 PER 的扩展，通过在序列中传播优先级以加速学习，显示在收敛更快和 Atari 性能相比 PER 的改进。作者在理论上证明收敛速度的优势并在 Blind Cliffwalk 和 Atari 2600 上显示经验收益。

ABSTRACT

Experience replay is widely used in deep reinforcement learning algorithms and allows agents to remember and learn from experiences from the past. In an effort to learn more efficiently, researchers proposed prioritized experience replay (PER) which samples important transitions more frequently. In this paper, we propose Prioritized Sequence Experience Replay (PSER) a framework for prioritizing sequences of experience in an attempt to both learn more efficiently and to obtain better performance. We compare the performance of PER and PSER sampling techniques in a tabular Q-learning environment and in DQN on the Atari 2600 benchmark. We prove theoretically that PSER is guaranteed to converge faster than PER and empirically show PSER substantially improves upon PER.

研究动机与目标

通过改进回放采样提高样本效率来推动强化学习中的数据高效学习。
将 PER 扩展以纳入时间序列信息和向后优先度衰减以传播学习信号。
提供理论收敛见解并在合成和基准环境中进行经验验证。
在 Atari 2600 上使用带 PSER 的 DQN 展示数据效率和最终性能的实际改进。

提出的方法

给定基于 TD 误差为转移分配优先级的前置序列经验回放（PSER），并将这些优先级衰减/传播到同一回合内的先前转移。
形式化两种衰减方案（MAX 与 ADD），使用衰减系数 rho 和窗口 W 对优先级进行反向传播。
引入衰减保护参数 eta，以防止优先级崩溃并维持学习信号的传播。
在 DQN 框架之上应用 PSER，并在 Blind Cliffwalk 和 Atari 2600 基准测试中与 PER 进行比较。
结合重要性采样权重以纠正采样偏差，类似于既有工作中的 β 参数。
使用坐标下降对部分 Atari 游戏调优 PSER 超参数，以报道具广义性的结果。

实验结果

研究问题

RQ1通过动作序列向后传播基于 TD 误差的优先级（PSER）是否能比标准 PER 提供更快的收敛速度？
RQ2衰减方案（MAX 与 ADD）、初始优先级策略（MaxPrio 与 CurrentTD）以及 eta 参数如何影响 PSER 的性能和稳定性？
RQ3PSER 能否在像 Atari 2600 这样的标准基准上使用 DQN 相较于 PER 提供经验提升？
RQ4相对于 PER，PSER 收敛速度有哪些理论保证？

主要发现

抽样策略	中位数	均值
PSER	109%	832%
PER	88%	607%

PSER 在 Atari 2600 基准测试游戏中显著提升了相对于 PER 的性能。
在 Blind Cliffwalk 环境中，由于向后优先级衰减，PSER 的收敛速度快于 PER。
PSER 在 no-ops 模式下，在 55 个 Atari 游戏中的中位数人类归一化分数为 109%，均值为 832%，而 PER 为中位 88%、均值 607%。
理论结果表明，在具衰减系数 ρ 的 Blind Cliffwalk 设置下，PSER 的收敛速度比 PER 更快。
消融研究表明，在 PSER 中 MAX 衰减通常优于 ADD 衰减。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。