QUICK REVIEW

[論文レビュー] CEM-RL: Combining evolutionary and gradient-based methods for policy search

Aloïs Pourchot, Olivier Sigaud|arXiv (Cornell University)|Oct 2, 2018

Reinforcement Learning in Robotics参考文献 30被引用数 95

ひとこと要約

CEM-RL は Cross-Entropy Method を TD3 と統合し、進化的探索と勾配ベースのポリシー改善を共同で活用することで、連続制御のベンチマーク全体で競争力のあるまたは上回る性能と安定性を実現します。

ABSTRACT

Deep neuroevolution and deep reinforcement learning (deep RL) algorithms are two popular approaches to policy search. The former is widely applicable and rather stable, but suffers from low sample efficiency. By contrast, the latter is more sample efficient, but the most sample efficient variants are also rather unstable and highly sensitive to hyper-parameter setting. So far, these families of methods have mostly been compared as competing tools. However, an emerging approach consists in combining them so as to get the best of both worlds. Two previously existing combinations use either an ad hoc evolutionary algorithm or a goal exploration process together with the Deep Deterministic Policy Gradient (DDPG) algorithm, a sample efficient off-policy deep RL algorithm. In this paper, we propose a different combination scheme using the simple cross-entropy method (CEM) and Twin Delayed Deep Deterministic policy gradient (td3), another off-policy deep RL algorithm which improves over ddpg. We evaluate the resulting method, cem-rl, on a set of benchmarks classically used in deep RL. We show that cem-rl benefits from several advantages over its competitors and offers a satisfactory trade-off between performance and sample efficiency.

研究の動機と目的

進化的戦略と深屈性的強化学習をポリシー探索に組み合わせ、探索の安定性とサンプル効率のバランスを取る動機づけ。
cross-entropy method を TD3 ベースのクリティック主導の勾配更新と結びつけた具体的な方法 (cem-rl) を提案する。
cem-rl をベースライン (cem、td3、multi-actor td3) および既存のハイブリッド (erl) と Mujoco の標準ベンチマークで評価する。
進化的成分の寄与と勾配ベースの改善が性能と安定性に与える影響を分析する。

提案手法

現在の平均ポリシー周りの共分散 Sigma を用いてガウス分布からサンプルされた集団を使用する。
集団の半分を直接評価し、残りの半分は TD3/クリティックに導かれた勾配ステップによって改善した後に再評価する。
上位半分を用いて集団の平均と共分散を更新する（cem 更新）。
リプレイバッファを統合し、新しい経験でクリティックを訓練する；集団由来のアクターに対して勾配ステップを適用する。
サンプリングにおける重要度混合を強調する可能性があり、環境ステップと学習更新間のリソース配分を明示的に議論する。

実験結果

リサーチクエスチョン

RQ1cem-rl は標準的な連続制御ベンチマークにおいて、cem および td3 の成分や td3 のマルチアクター変種を上回るのか。
RQ2cem-rl は erl と比較して最終的な性能、収束速度、および学習の安定性の点でどうであるか。
RQ3組み合わせは実践的にサンプル効率を改善し、ハイパーパラメータに対するロバスト性を高めるのか。
RQ4進化的成分は単に集団ベースの探索を提供する以上の寄与をどの程度示すのか。
RQ5cem-rl が劣る可能性のある制限要因や環境特性は何か。

主な発見

cem-td3 は複数の Mujoco ベンチマークで cem、td3、マルチアクター td3 を上回る傾向があり、学習の分散を抑える。
cem-rl 手法 (cem-ddpg および cem-td3) は erl を複数環境で上回り、cem-td3 が最終的な性能を最も良く、収束も速い傾向を示す。
アブレーションにより、勾配と整合した TD3 ガイダンスをアクター間で共有する (multi-actor TD3) へ置換すると性能が落ちることが示され、進化-勾配の組み合わせの利点が示唆される。
erl と比較して、特に walker2d-v2 や ant-v2 のような難しい環境で cem-td3 の安定性と最終性能がより高いことが多い。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。