QUICK REVIEW

[论文解读] Efficient Parallel Methods for Deep Reinforcement Learning

Alfredo Vicente Clemente, Humberto Nicolás Castejón|arXiv (Cornell University)|May 13, 2017

Reinforcement Learning in Robotics参考文献 5被引用 80

一句话总结

PAAC 引入一个面向GPU的、同步的、多智能体并行框架，在单机上从数百个智能体进行策略学习，数小时内在 Atari 上达到最先进的结果。与 Gorila、A3C 和 GA3C 在多款游戏中的表现相比，具有优势。

ABSTRACT

We propose a novel framework for efficient parallelization of deep reinforcement learning algorithms, enabling these algorithms to learn from multiple actors on a single machine. The framework is algorithm agnostic and can be applied to on-policy, off-policy, value based and policy gradient based algorithms. Given its inherent parallelism, the framework can be efficiently implemented on a GPU, allowing the usage of powerful models while significantly reducing training time. We demonstrate the effectiveness of our framework by implementing an advantage actor-critic algorithm on a GPU, using on-policy experiences and employing synchronous updates. Our algorithm achieves state-of-the-art performance on the Atari domain after only a few hours of training. Our framework thus opens the door for much faster experimentation on demanding problem domains. Our implementation is open-source and is made public at https://github.com/alfredvc/paac

研究动机与目标

为在单机上高效实现深度强化学习的并行化提供动机与支持。
开发一个与算法无关的框架，能够处理在线策略、离线策略、基于价值与策略梯度的方法。
证明使用大量智能体的同步更新能够实现快速学习与强性能。
提供开源实现以加速在 demanding 领域的实验。

提出的方法

提出一个具有 n_e 个环境和 n_w 个工作者的通用并行框架，以收集经验并对单一神经网络参数集合进行批量更新。
使用同步、分批更新以避免异步方法中常见的过时梯度问题。
展示 Parallel Advantage Actor-Critic (PAAC)，一种 n-step A2C 风格的算法，策略网络与价值网络共享参数。
在 PAAC 中，使用大小为 n_e * t_max 的小批量对策略和价值进行梯度计算并同步更新权重。
通过两个网络架构进行实验以比较模型大小对学习的影响（arch_nips 和 arch_nature），并在 GPU 上使用 TensorFlow 在 Atari 2600 上进行训练。

实验结果

研究问题

RQ1单机高并行框架是否能够高效地支持在线策略、离线策略、基于价值和策略梯度的 RL 算法？
RQ2在 GPU 上进行同步的多智能体训练是否能在 Atari 上达到最先进的性能，并显著缩短相对于以往并行方法的训练时间？
RQ3不同的网络架构和智能体数量如何影响并行 RL 设置中的学习速度与稳定性？
RQ4在扩展并行智能体数量时，环境交互时间与学习时间之间有哪些权衡？

主要发现

Game	Gorila	A3C FF	GA3C	PAAC arch_nips	PAAC arch_nature
Amidar	1189.70	263.9	218	701.8	1348.3
Centipede	8432.30	3755.8	7386	5747.32	7368.1
Beam Rider	3302.9	22707.9	N/A	4062.0	6844.0
Boxing	94.9	59.8	92	99.6	99.8
Breakout	402.2	681.9	N/A	470.1	565.3
Ms. Pacman	3233.50	653.7	1978	2194.7	1976.0
Name This Game	6182.16	10476.1	5643	9743.7	14068.0
Pong	18.3	5.6	18	20.6	20.9
Qbert	10815.6	15148.8	14966.0	16561.7	17249.2
Seaquest	13169.06	2355.4	1706	1754.0	1755.3
Space Invaders	1883.4	15730.5	N/A	1077.3	1427.8
Up n Down	12561.58	74705.7	8623	88105.3	100523.3

PAAC 在单机上仅训练数小时就对 Atari 2600 领域达到了最先进的性能。
在报道的结果中，PAAC 在 12 款游戏中有 8 款优于 Gorila，在 8 款游戏优于 A3C FF。
PAAC 在大多数测试游戏中与 GA3C 相匹配，在若干游戏中甚至超过了 GA3C，如表 1 所示。
提高环境数量 n_e 会加速训练时间（在给定时间步数上更快达到进度），并保持竞争力的分数，但在非常高的 n_e 下若学习率缩放不足可能出现发散。
该框架实现了真正的策略在线学习，只有一个参数拷贝和同步更新，从而降低了因为过时梯度和异步性带来的问题。
实验表明该框架能够在两种架构（arch_nips 和 arch_nature）下在 GPU 上训练，并实现 Atari 的显著加速（从天级别到小时级别的速度）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。