QUICK REVIEW

[论文解读] An empirical investigation of the challenges of real-world reinforcement learning

Gabriel Dulac-Arnold, Nir Levine|arXiv (Cornell University)|Mar 24, 2020

Reinforcement Learning in Robotics参考文献 133被引用 52

一句话总结

本文将九个现实世界的强化学习挑战形式化，分析它们对使用 realworldrl-suite 的最先进代理的影响，并提出一个用于评估的开源基准。

ABSTRACT

Reinforcement learning (RL) has proven its worth in a series of artificial domains, and is beginning to show some successes in real-world scenarios. However, much of the research advances in RL are hard to leverage in real-world systems due to a series of assumptions that are rarely satisfied in practice. In this work, we identify and formalize a series of independent challenges that embody the difficulties that must be addressed for RL to be commonly deployed in real-world systems. For each challenge, we define it formally in the context of a Markov Decision Process, analyze the effects of the challenge on state-of-the-art learning algorithms, and present some existing attempts at tackling it. We believe that an approach that addresses our set of proposed challenges would be readily deployable in a large number of real world problems. Our proposed challenges are implemented in a suite of continuous control environments called the realworldrl-suite which we propose an as an open-source benchmark.

研究动机与目标

在马尔可夫决策过程（MDP）中识别并定义现实世界强化学习挑战及其直觉。
提供正式定义并分析每个挑战对学习算法的影响。
开发一个基准套件（realworldrl-suite），在 DeepMind Control Suite 的基础上扩展以研究这些挑战。
在各挑战中评估最先进的代理（DMPO 与 D4PG）以建立基线。
提供指南和资源，使在现实世界仿真环境中的强化学习测试具有可重复性。

提出的方法

在 MDP 框架内形式化地定义九个现实世界强化学习挑战。
在 realworldrl-suite 中实现具有挑战性的环境，借助扰动扩展 DeepMind Control Suite。
在多个难度不同的任务上对两种最先进代理（DMPO 和 D4PG）进行基准测试。
引入收敛前的后悔度和收敛后的不稳定性指标，以评估样本效率和稳定性。
对部分挑战进行标定并组合成一个综合基准任务，以进行基线比较。
提供用于复现实验的开源代码和文档。

实验结果

研究问题

RQ1每个现实世界挑战如何影响强化学习的学习性能和样本效率？
RQ2在这些现实世界挑战下 DMPO 与 D4PG 的表现对比如何？
RQ3将挑战组合成单一基准任务的影响是什么？
RQ4哪些挑战对连续控制任务的稳定性和收敛性影响最大？

主要发现

DMPO 在所有任务中表现出比 D4PG 更高的收敛前后悔度。
通常显示出更高的样本效率，在许多情况下比 DMPO 越发稳定收敛。
增加动作、观测或奖励的延迟会降低性能，其中动作/观测延迟尤为显著。
添加高维度或嘈杂的虚拟状态维度可能减慢收敛，但在某些任务上学习者仍可接近最优性能。
一个综合现实世界挑战基准显示，在温和扰动下，最先进代理可能迅速失败，凸显需要更鲁棒的方法。
本文提供一个开源基准（realworldrl-suite），以标准化对这些挑战的评估。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。