QUICK REVIEW

[論文レビュー] On the Importance of Hyperparameter Optimization for Model-based Reinforcement Learning

Baohe Zhang, Raghu Rajan|arXiv (Cornell University)|Feb 26, 2021

Machine Learning and Data Classification参考文献 21被引用数 32

ひとこと要約

自動ハイパーパラメータ最適化（HPO）はモデルベースのRLにおいて人間の専門家を上回る可能性があり、動的HPOはさらなる性能向上をもたらす。非定常性と環境依存のハイパーパラメータの影響を際立たせる。

ABSTRACT

Model-based Reinforcement Learning (MBRL) is a promising framework for learning control in a data-efficient manner. MBRL algorithms can be fairly complex due to the separate dynamics modeling and the subsequent planning algorithm, and as a result, they often possess tens of hyperparameters and architectural choices. For this reason, MBRL typically requires significant human expertise before it can be applied to new problems and domains. To alleviate this problem, we propose to use automatic hyperparameter optimization (HPO). We demonstrate that this problem can be tackled effectively with automated HPO, which we demonstrate to yield significantly improved performance compared to human experts. In addition, we show that tuning of several MBRL hyperparameters dynamically, i.e. during the training itself, further improves the performance compared to using static hyperparameters which are kept fixed for the whole training. Finally, our experiments provide valuable insights into the effects of several hyperparameters, such as plan horizon or learning rate and their influence on the stability of training and resulting rewards.

研究の動機と目的

モデルベース RL における高次元のハイパーパラメータ空間とデータ効率の懸念のため、自動 HPO の必要性を動機づける。
PETS に対して、HPO が人間が調整した設定を上回るかを評価する。
静的と動的（時系列変動）ハイパーパラメータ構成とそれらが学習の安定性と最終報酬に与える影響を調査する。
動的チューニングが有利な場合の実用的な指針と、履歴のコピーが MB RL の PBT にどのように影響するかを提供する。

提案手法

Hyperband（マルチフィデリティ）を含む3つのHPOアプローチを検討する。
Population Based Training (PBT) をバックトラッキングの有無とともに用いる手法とベースラインとしてのランダム探索。
PET S に対して2つのハイパーパラメータ群（モデル訓練と CEM 最適化子）、および結合最適化も検討する。
MuJoCo の4タスク（Pusher、Reacher、Hopper、HalfCheetah）と Daisy hexapod のシミュレーションで評価する。
評価の確率的性を扱うため、最も最近の試行の平均リターンをHPOの目的として用いる。
環境と実行間で学習された動的スケジュールの転送性を調査し、PBT における履歴コピーの影響を分析する。

実験結果

リサーチクエスチョン

RQ1自動 HPO は PETS ベースの MB RL の複数環境で手動調整のハイパーパラメータを上回るか？
RQ2MBRL において動的ハイパーパラメータ配置は静的構成より利得をもたらすか？
RQ3動的 HPO スケジュールの環境と実行間での転送性はどの程度か？
RQ4MBRL の PBT における履歴コピーの影響は？
RQ5どのハイパーパラメータグループ（モデル訓練と CEM 最適化子）が性能向上の大部分を生み出すか？

主な発見

自動 HPO の設定は Hopper と HalfCheetah で手動調整のベースラインを上回り、最終リターンが最大で 10 倍向上。
動的 HPO スケジュールは同等の予算で静的チューニングを上回り、MBRL における非定常性を示唆。
ハイパーパラメータの重要度は環境によって、モデル訓練の最適化か CEM 最適化子のパラメータを最適化するかによって異なる。
PBT with history copying across members significantly improves performance and stability, whereas omitting history copying harms performance.
best static configurations for some environments resemble mid-range plan horizons, while dynamic methods adapt horizons over time.
Tuning model training hyperparameters generally has larger impact than tuning CEM hyperparameters in HalfCheetah, with environment-dependent differences observed.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。