QUICK REVIEW

[论文解读] The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning

Audrūnas Gruslys, Will Dabney|arXiv (Cornell University)|Apr 15, 2017

Reinforcement Learning in Robotics被引用 28

一句话总结

Reactor 是一种用于强化学习的快速且样本高效的演员-critic智能体，结合了用于多步离策略分布式学习的分布式Retrace算法、一种新颖的 $β$-LOO 策略梯度以降低方差，以及一种利用时间局部性的优先经验回放机制。它在不到一天的训练时间内于 57 个 Atari 2600 游戏上实现了最先进性能，相较于 Rainbow 和 A3C 等先前方法，在样本效率和时间效率方面均表现更优。

ABSTRACT

In this work we present a new agent architecture, called Reactor, which combines multiple algorithmic and architectural contributions to produce an agent with higher sample-efficiency than Prioritized Dueling DQN (Wang et al., 2016) and Categorical DQN (Bellemare et al., 2017), while giving better run-time performance than A3C (Mnih et al., 2016). Our first contribution is a new policy evaluation algorithm called Distributional Retrace, which brings multi-step off-policy updates to the distributional reinforcement learning setting. The same approach can be used to convert several classes of multi-step policy evaluation algorithms designed for expected value evaluation into distributional ones. Next, we introduce the \\b{eta}-leave-one-out policy gradient algorithm which improves the trade-off between variance and bias by using action values as a baseline. Our final algorithmic contribution is a new prioritized replay algorithm for sequences, which exploits the temporal locality of neighboring observations for more efficient replay prioritization. Using the Atari 2600 benchmarks, we show that each of these innovations contribute to both the sample efficiency and final agent performance. Finally, we demonstrate that Reactor reaches state-of-the-art performance after 200 million frames and less than a day of training.

研究动机与目标

开发一种能够实现高样本效率和低实际训练时间的强化学习智能体。
将离策略、多步、分布式学习整合进深度演员-critic 框架。
通过使用动作价值估计作为基线来改进策略梯度估计，以降低方差。
设计一种新颖的优先经验回放机制，以利用序列中过渡的时间局部性。
在 Atari 2600 基准测试中，以最小的训练时间和样本复杂度实现最先进性能。

提出的方法

提出 Distributional Retrace($\lambda$)，一种用于分布式强化学习的多步离策略算法，将 Retrace 扩展至学习价值分布。
提出 $β$-LOO（留一法）策略梯度方法，利用动作价值估计作为基线，以降低策略梯度估计中的方差。
开发一种上下文感知的优先经验回放机制，根据时间接近度和回报估计对过渡进行优先排序，以提升样本效率。
采用深度神经网络架构，分别设置价值和优势估计的头，结合目标网络和经验回放。
通过多个智能体异步训练，并利用参数服务器实现高训练吞吐量和低实际训练时间。
应用 Retrace 进行离策略回报估计，使使用与目标策略不同的行为策略收集的经验也能实现稳定训练。

实验结果

研究问题

RQ1分布式强化学习智能体是否能同时实现高样本效率和低实际训练时间？
RQ2在策略梯度估计中使用动作价值估计作为基线，对方差和性能有何影响？
RQ3在经验回放优先排序中利用时间局部性，在序列决策任务中能在多大程度上提升样本效率？
RQ4结合离策略学习与分布式回报的混合演员-critic 架构，是否能在 Atari 2600 上超越现有最先进智能体？
RQ5Reactor 架构的各个组件在样本效率和训练速度方面对最终性能的贡献如何？

主要发现

在训练 2 亿帧后，Reactor 在 57 个 Atari 2600 游戏上的平均人类归一化得分达到 1.65，平均排名为 4.58，优于所有先前方法，包括 Rainbow 和 A3C。
在 5 亿帧和四天的训练后，Reactor 的平均人类归一化得分达到 1.82，平均排名为 3.65，甚至在无操作起始设置下超越了 Rainbow。
在随机人类起始设置下评估时，Reactor 的分布式版本比其非分布式版本泛化能力更强，表明其鲁棒性更优。
$β$-LOO 策略梯度在最终性能和稳定性方面显著优于 TISLR 基线。
优先经验回放是影响最大的组件，但所有组件（Distributional Retrace、$β$-LOO 和上下文回放）均对样本效率和最终性能有显著贡献。
Reactor 在不到一天的训练时间内即达到最先进性能，相较于 DQN 和 Rainbow 等先前方法，时间效率有显著提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。