QUICK REVIEW

[论文解读] rlpyt: A Research Code Base for Deep Reinforcement Learning in PyTorch

Adam Stooke, Pieter Abbeel|arXiv (Cornell University)|Sep 3, 2019

Reinforcement Learning in Robotics参考文献 33被引用 52

一句话总结

rlpyt 提供一个模块化的基于 PyTorch 的代码库，实现三大主流深度强化学习算法家族（策略梯度、DQN 变体和 Q 函数策略梯度），具备共享的高吞吐量基础设施以及各种采样/优化配置。它强调单节点并行性、可重复性，以及面向小到中等规模 RL 研究的实用工具。

ABSTRACT

Since the recent advent of deep reinforcement learning for game play and simulated robotic control, a multitude of new algorithms have flourished. Most are model-free algorithms which can be categorized into three families: deep Q-learning, policy gradients, and Q-value policy gradients. These have developed along separate lines of research, such that few, if any, code bases incorporate all three kinds. Yet these algorithms share a great depth of common deep reinforcement learning machinery. We are pleased to share rlpyt, which implements all three algorithm families on top of a shared, optimized infrastructure, in a single repository. It contains modular implementations of many common deep RL algorithms in Python using PyTorch, a leading deep learning library. rlpyt is designed as a high-throughput code base for small- to medium-scale research in deep RL. This white paper summarizes its features, algorithms implemented, and relation to prior work, and concludes with detailed implementation and usage notes. rlpyt is available at https://github.com/astooke/rlpyt.

研究动机与目标

推动共用的高吞吐量基础设施，以统一三大类深度 RL 算法的框架。
在 PyTorch 中提供常见 RL 算法的模块化、可复用实现。
通过串行、并行和异步采样与优化配置，实现灵活的实验。

提出的方法

在共用基础设施上实现三大算法家族：策略梯度（A2C、PPO）、DQN 及其变体（Double、Dueling、Categorical、Rainbow、类似 R2D2），以及 Q 函数策略梯度（DDPG、TD3、SAC）。
支持带有 n-step 回报的回放缓冲区、序列回放、优先回放，以及基于帧的缓冲以提高内存效率。
提供多种采样配置（Serial、Parallel-CPU、Parallel-GPU、Alternating-GPU）以及使用 PyTorch DistributedDataParallel（NCCL/gloo）进行同步和异步优化。
引入命名的数组元组（namedarraytuple）数据结构，以灵活的前导维度和多模态数据组织数组。
确保与 OpenAI Gym 兼容，提供 env_info 与空间（spaces）的包装器，以及用于在本地硬件上运行大量实验的启动工具。

实验结果

研究问题

RQ1单个、模块化代码库是否能够在单机硬件上高吞吐地高效支持多种深度 RL 算法家族？
RQ2不同的采样和优化配置（串行、并行 CPU/GPU、异步）如何影响吞吐量和在常见 RL 基准测试中的学习性能？
RQ3如命名的数组数据结构等实际数据结构是否能在跨越多样化 RL 算法中提升数据组织与可扩展性？

主要发现

rlpyt 使用单一代码库在 Atari 和 MuJoCo 环境上再现了具有竞争力的学习曲线。
类似 R2D2 的循环回放设置在非分布式环境中演示，达到高采样吞吐量。
异步和并行采样模式可以提升硬件利用率，并在单工作站内实现多 GPU 训练。
引入新的 namedarraytuple 数据结构，用于在不展平的情况下管理多模态观测和批处理数据。
该框架强调串行模式用于调试，并在需要时逐步采用并行配置。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。