QUICK REVIEW

[論文レビュー] Deep Reinforcement Learning that Matters

Peter Henderson, Riashat Islam|arXiv (Cornell University)|Sep 19, 2017

Evolutionary Algorithms and Applications被引用数 364

ひとこと要約

この論文は深層強化学習における再現性、実験実践、報告を調査し、ポリシー勾配法に焦点を当て、厳密性と比較可能性を改善するためのガイドラインを提案する。

ABSTRACT

In recent years, significant progress has been made in solving challenging problems across various domains using deep reinforcement learning (RL). Reproducing existing work and accurately judging the improvements offered by novel methods is vital to sustaining this progress. Unfortunately, reproducing results for state-of-the-art deep RL methods is seldom straightforward. In particular, non-determinism in standard benchmark environments, combined with variance intrinsic to the methods, can make reported results tough to interpret. Without significance metrics and tighter standardization of experimental reporting, it is difficult to determine whether improvements over the prior state-of-the-art are meaningful. In this paper, we investigate challenges posed by reproducibility, proper experimental techniques, and reporting procedures. We illustrate the variability in reported metrics and results when comparing against common baselines and suggest guidelines to make future results in deep RL more reproducible. We aim to spur discussion about how to ensure continued progress in the field by minimizing wasted effort stemming from results that are non-reproducible and easily misinterpreted.

研究の動機と目的

深層RL実験の再現性における変動の源を評価する。
ハイパーパラメータ、アーキテクチャ、シード、環境が結果に与える影響を評価する。
異なるコードベースと実装の詳細がベースラインに与える影響を評価する。
再現性と公正な比較を改善するためのガイドラインと統計的手法を提案する。

提案手法

連続制御のためのポリシー勾配法における再現性に影響を与える要因をレビューし、実験的に分析する。
ハイパーパラメータ、ネットワークアーキテクチャ、報酬スケーリング、シード、環境を変化させた統制実験を実施する。
MuJoCoタスクで複数のベースライン実装を比較する（例: OpenAI Baselines, TRPO, PPO, DDPG, ACKTR）。
複数のシードに対する平均と標準誤差を用い、検定とブートストラップ法を議論する。

Figure 1: Growth of published reinforcement learning papers. Shown are the number of RL-related publications (y-axis) per year (x-axis) scraped from Google Scholar searches.

実験結果

リサーチクエスチョン

RQ1ハイパーパラメータはアルゴリズムと環境全体でベースラインの性能にどのように影響するか。
RQ2ネットワークアーキテクチャと活性化関数の選択が学習結果に与える影響は何か。
RQ3ランダムシード、試行回数、環境の確率性は報告された結果にどう影響するか。
RQ4異なるコードベースがベースラインの性能をどの程度変えるか。

主な発見

ハイパーパラメータはアルゴリズムと環境を横断して大きく不安定な影響を及ぼすことがある。
ネットワークアーキテクチャと活性化関数は性能に大きく影響し、選択されたアルゴリズムと相互作用する。
ランダムシードと試行回数は大きな性能変動を引き起こす可能性があり、適切な統計的枠組みなしにシード間での平均を取ると誤解を招くことがある。
環境の特性（安定性 vs. 不安定性）はアルゴリズムの性能に強く影響し、どの手法が最も良いかを変えることがある。
コードベース間の実装の詳細は substantial な性能差を生む可能性があり、全ての詳細を報告しコードを共有する必要性を浮き彫りにする。
検定とブートストラップ分析は、観察された改善が信頼できるかどうかについて有意義な洞察を提供する。

Figure 2: Significance of Policy Network Structure and Activation Functions PPO (left), TRPO (middle) and DDPG (right).

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。