[Paper Review] Deep Reinforcement Learning at the Edge of the Statistical Precipice
The paper argues that few-run deep RL evaluations suffer from high statistical uncertainty and proposes robust, scalable methods (interval estimates, performance profiles, IQM) to reliably compare algorithms.
Deep reinforcement learning (RL) algorithms are predominantly evaluated by comparing their relative performance on a large suite of tasks. Most published results on deep RL benchmarks compare point estimates of aggregate performance such as mean and median scores across tasks, ignoring the statistical uncertainty implied by the use of a finite number of training runs. Beginning with the Arcade Learning Environment (ALE), the shift towards computationally-demanding benchmarks has led to the practice of evaluating only a small number of runs per task, exacerbating the statistical uncertainty in point estimates. In this paper, we argue that reliable evaluation in the few run deep RL regime cannot ignore the uncertainty in results without running the risk of slowing down progress in the field. We illustrate this point using a case study on the Atari 100k benchmark, where we find substantial discrepancies between conclusions drawn from point estimates alone versus a more thorough statistical analysis. With the aim of increasing the field's confidence in reported results with a handful of runs, we advocate for reporting interval estimates of aggregate performance and propose performance profiles to account for the variability in results, as well as present more robust and efficient aggregate metrics, such as interquartile mean scores, to achieve small uncertainty in results. Using such statistical tools, we scrutinize performance evaluations of existing algorithms on other widely used RL benchmarks including the ALE, Procgen, and the DeepMind Control Suite, again revealing discrepancies in prior comparisons. Our findings call for a change in how we evaluate performance in deep RL, for which we present a more rigorous evaluation methodology, accompanied with an open-source library rliable, to prevent unreliable results from stagnating the field.
Motivation & Objective
- Highlight the role of statistical uncertainty in few-run deep RL evaluations.
- Showcase how point estimates can mislead conclusions on RL benchmarks.
- Propose practical tools and metrics to quantify and compare performance under limited runs.
- Recommend an evaluation methodology and open-source tool for robust reporting.
Proposed method
- Advocate reporting interval estimates via stratified bootstrap confidence intervals.
- Introduce performance profiles and run-score distributions to visualize variability across tasks.
- Recommend robust aggregate metrics like interquartile mean (IQM) and optimality gap.
- Propose using average probability of improvement to compare algorithms.
- Demonstrate the methodology on Atari 100k, ALE, Procgen, and DeepMind Control Suite benchmarks.
- Provide open-source library rliable for implementing these tools.
Experimental results
Research questions
- RQ1How does statistical uncertainty affect reported deep RL performance when only a few training runs are feasible?
- RQ2Can interval estimates and robust metrics provide reliable comparisons across tasks in common RL benchmarks?
- RQ3Do performance profiles and score distributions offer a more informative picture than traditional mean/median reporting?
- RQ4What evaluation protocol changes are necessary to ensure fair, reproducible comparisons across methods?
Key findings
- Point estimates (mean/median) show substantial variability and can misrank algorithms in few-run regimes.
- Sample medians are biased and their uncertainty remains high with few runs, potentially overturning conclusions with more runs.
- Stratified bootstrap confidence intervals and percentile-based CIs provide reliable uncertainty estimates for small N.
- IQM often yields smaller confidence intervals and is more robust to outliers than the median.
- Performance profiles and score distributions reveal across-task variability and can change perceived rankings.
- Across benchmarks (Atari ALE/Atari 200M, ProcGen, DeepMind Control Suite), many claimed improvements do not hold under uncertainty or across tasks.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.