QUICK REVIEW

[论文解读] Deep Reinforcement Learning that Matters

Peter Henderson, Riashat Islam|arXiv (Cornell University)|Sep 19, 2017

Evolutionary Algorithms and Applications被引用 364

一句话总结

本论文研究深度强化学习中的再现性、实验实践和报告，聚焦策略梯度方法，并提出改进严格性和可比性的准则。

ABSTRACT

In recent years, significant progress has been made in solving challenging problems across various domains using deep reinforcement learning (RL). Reproducing existing work and accurately judging the improvements offered by novel methods is vital to sustaining this progress. Unfortunately, reproducing results for state-of-the-art deep RL methods is seldom straightforward. In particular, non-determinism in standard benchmark environments, combined with variance intrinsic to the methods, can make reported results tough to interpret. Without significance metrics and tighter standardization of experimental reporting, it is difficult to determine whether improvements over the prior state-of-the-art are meaningful. In this paper, we investigate challenges posed by reproducibility, proper experimental techniques, and reporting procedures. We illustrate the variability in reported metrics and results when comparing against common baselines and suggest guidelines to make future results in deep RL more reproducible. We aim to spur discussion about how to ensure continued progress in the field by minimizing wasted effort stemming from results that are non-reproducible and easily misinterpreted.

研究动机与目标

评估深度 RL 实验再现性变异性的来源。
评估超参数、架构、种子和环境如何影响结果。
评估不同代码库和实现细节对基线的影响。
提出改进再现性和公平比较的准则与统计实践。

提出的方法

综述并在策略梯度连续控制方法中对影响再现性的因素进行实验分析。
在受控实验中变化超参数、网络架构、奖励缩放、种子和环境。
在 MuJoCo 任务中比较多种基线实现（例如 OpenAI Baselines、TRPO、PPO、DDPG、ACKTR）。
使用多种种子下的均值与标准误差；讨论显著性检验和自举法（bootstrap）方法。

实验结果

研究问题

RQ1超参数如何影响不同算法和环境的基线性能？
RQ2网络架构和激活函数的选择对学习结果有何影响？
RQ3随机种子、试验次数和环境随机性如何影响报告的结果？
RQ4不同代码库在多大程度上改变基线性能？

主要发现

超参数在不同算法和环境中可能具有巨大且不一致的影响。
网络架构和激活函数显著影响性能，并与所选算法相互作用。
随机种子和试验次数可能导致较大性能方差；在没有适当统计框架时对种子求平均可能误导。
环境属性（稳定性与不稳定性）强烈影响算法性能，并可能改变哪种方法表现最好。
不同代码库的实现细节可能产生显著的性能差异，强调需要报告所有细节并共享代码。
显著性检验与自举分析为观察到的增益是否可靠提供有意义的洞见。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。