[论文解读] Deep Reinforcement Learning at the Edge of the Statistical Precipice
论文认为在少次运行的深度强化学习评估存在较高统计不确定性,并提出鲁棒、可扩展的方法(区间估计、性能轮廓、IQM)来可靠地比较算法。
Deep reinforcement learning (RL) algorithms are predominantly evaluated by comparing their relative performance on a large suite of tasks. Most published results on deep RL benchmarks compare point estimates of aggregate performance such as mean and median scores across tasks, ignoring the statistical uncertainty implied by the use of a finite number of training runs. Beginning with the Arcade Learning Environment (ALE), the shift towards computationally-demanding benchmarks has led to the practice of evaluating only a small number of runs per task, exacerbating the statistical uncertainty in point estimates. In this paper, we argue that reliable evaluation in the few run deep RL regime cannot ignore the uncertainty in results without running the risk of slowing down progress in the field. We illustrate this point using a case study on the Atari 100k benchmark, where we find substantial discrepancies between conclusions drawn from point estimates alone versus a more thorough statistical analysis. With the aim of increasing the field's confidence in reported results with a handful of runs, we advocate for reporting interval estimates of aggregate performance and propose performance profiles to account for the variability in results, as well as present more robust and efficient aggregate metrics, such as interquartile mean scores, to achieve small uncertainty in results. Using such statistical tools, we scrutinize performance evaluations of existing algorithms on other widely used RL benchmarks including the ALE, Procgen, and the DeepMind Control Suite, again revealing discrepancies in prior comparisons. Our findings call for a change in how we evaluate performance in deep RL, for which we present a more rigorous evaluation methodology, accompanied with an open-source library rliable, to prevent unreliable results from stagnating the field.
研究动机与目标
- 突出统计不确定性在少次深度RL评估中的作用。
- 展示点估计如何误导对强化学习基准的结论。
- 提出在有限运行条件下量化和比较性能的实用工具和指标。
- 推荐一种评估方法学以及用于鲁棒报告的开源工具。
提出的方法
- 提倡通过分层自助法置信区间报告区间估计。
- 引入性能轮廓和运行分数分布以可视化跨任务的变异性。
- 推荐鲁棒聚合指标,如四分位均值(IQM)和最优性差距。
- 提出使用改进的平均概率来比较算法。
- 在 Atari 100k、ALE、Procgen 和 DeepMind Control Suite 基准上演示该方法。
- 提供用于实现这些工具的开源库 rliable。
实验结果
研究问题
- RQ1在仅有少量训练运行可行时,统计不确定性如何影响报道的深度强化学习性能?
- RQ2区间估计和鲁棒指标是否能在常见 RL 基准测试中提供跨任务的可靠比较?
- RQ3性能轮廓和分数分布是否比传统的均值/中位数报告提供更有信息量的图景?
- RQ4为确保方法间的公平、可重复比较,需要对评估协议进行哪些变更?
主要发现
- 点估计(均值/中位数)显示出显著的变异性,在少次运行情形下可能导致错误排序。
- 样本中位数有偏差,少次运行时其不确定性仍然很高,更多运行可能推翻先前结论。
- 分层自助法置信区间和基于百分位数的置信区间为小样本提供可靠的不确定性估计。
- IQM 往往给出更小的置信区间,对异常值比中位数更鲁棒。
- 性能轮廓和分数分布揭示跨任务的变异性,可能改变感知的排序。
- 在基准(Atari ALE/Atari 200M、ProcGen、DeepMind Control Suite)中,许多声称的改进在不确定性或跨任务情况下并不成立。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。