QUICK REVIEW

[论文解读] Distributed Distributional Deterministic Policy Gradients

Gabriel Barth-Maron, Matthew W. Hoffman|arXiv (Cornell University)|Apr 23, 2018

Reinforcement Learning in Robotics参考文献 20被引用 283

一句话总结

该论文引入 D4PG，一种带有分布式 off-policy actor-critic 算法的分布式 critic 和 N-step 返回，在多样化的连续控制任务上实现了最先进的性能。

ABSTRACT

This work adopts the very successful distributional perspective on reinforcement learning and adapts it to the continuous control setting. We combine this within a distributed framework for off-policy learning in order to develop what we call the Distributed Distributional Deep Deterministic Policy Gradient algorithm, D4PG. We also combine this technique with a number of additional, simple improvements such as the use of $N$-step returns and prioritized experience replay. Experimentally we examine the contribution of each of these individual components, and show how they interact, as well as their combined contributions. Our results show that across a wide variety of simple control tasks, difficult manipulation tasks, and a set of hard obstacle-based locomotion tasks the D4PG algorithm achieves state of the art performance.

研究动机与目标

在连续控制环境中采用对 critic 的分布式视角。
开发一个分布式的 off-policy 学习框架以加速数据收集。
整合 N-step 返回与优先经验回放以提升学习。
系统地消融各组成部分以理解它们的贡献和交互作用。
在控制、操作和跑酷任务中展示最先进的性能。

提出的方法

使用分布式 critic（分类分布）来建模回报的不确定性。
用分布式 Bellman 更新和 actor-critic 梯度扩展 DDPG。
将 N-step 返回并入分布式更新。
将经验收集分布到 K 个并行 actor，并写入共享的回放表。
在分布式环境中应用带重要性抽样的优先经验回放。
利用 ApeX 框架来管理并行 actor 和基于回放的学习。

实验结果

研究问题

RQ1分布式 critic 如何影响连续控制中的学习稳定性与性能？
RQ2将分布式更新、分布式 actor、N-step 返回和优先回放结合起来的效果是什么？
RQ3在标准控制、操作和跑酷任务中，哪些组件对性能提升贡献最大？
RQ4在存在分布式更新和分布式数据收集的情况下，优先级回放是否有益？

主要发现

分布式更新在性能方面有提升，尤其是在更困难的任务，如人形机器人任务和操作/操纵域。
N-step 返回在所提改进中提供了最大相对收益。
完整的 D4PG 算法在标准控制、操作和跑酷任务上实现了最先进的性能。
带优先级经验回放对 D4PG 的增益有限，有时甚至是多余的。
展开长度为 N=5 的表现始终优于 N=1，在某些任务中 N=1 表现存在不稳定性。
分布式 actor 加上共享回放表显著降低了实际训练时间。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。