QUICK REVIEW

[論文レビュー] Diagnosing Non-Intermittent Anomalies in Reinforcement Learning Policy Executions (Short Paper)

Natan, Avraham, Stern, Roni|arXiv (Cornell University)|Jul 20, 2017

Reinforcement Learning in Robotics参考文献 11被引用数 11,253

ひとこと要約

この論文は、信頼領域制約の複雑さを避けつつ、安定的かつサンプル効率の良い強化学習アルゴリズムであるProximal Policy Optimization (PPO)を紹介している。PPOは、ポリシー更新の大きさを制限するためのクリッピングされた補助目的関数を用いることで、大きな更新による性能の劣化を防ぎ、信頼性の高い性能を実現する。PPOは連続的制御ベンチマークおよびアーケードゲームで最先端の結果を達成し、A2Cを上回り、ACERと同等の性能を示すが、はるかに少ない複雑さで実現している。

ABSTRACT

Due to the safety risks and training sample inefficiency, it is often preferred to develop controllers in simulation. However, minor differences between the simulation and the real world can cause a significant sim-to-real gap. This gap can reduce the effectiveness of the developed controller. In this paper, we examine a case study of transferring an octorotor reinforcement learning controller from simulation to the real world. First, we quantify the effectiveness of the real-world transfer by examining safety metrics. We find that although there is a noticeable (around 100%) increase in deviation in real flights, this deviation may not be considered unsafe, as it will be within > 2m safety corridors. Then, we estimate the densities of the measurement distributions and compare the Jensen-Shannon divergences of simulated and real measurements. From this, we show that the vehicle’s orientation is significantly different between simulated and real flights. We attribute this to a different flight mode in real flights where the vehicle turns to face the next waypoint. We also find that the reinforcement learning controller actions appear to correctly counteract disturbance forces. Then, we analyze the errors of a measurement autoencoder and state transition model neural network applied to real data. We find that these models further reinforce the difference between the simulated and real attitude control, showing the errors directly on the flight paths. Finally, we discuss important lessons learned in the sim-to-real transfer of our controller.

研究の動機と目的

信頼領域法のサンプル効率と安定性を、標準的な方策勾配法の単純さとスケーラビリティと組み合わせた強化学習アルゴリズムの開発。
従来の手法の限界の解消：vanilla方策勾配法の低いサンプル効率、TRPOの高い複雑さ、ドロップアウトやパラメータ共有を用いる現代のディープラーニングアーキテクチャとの非互換性。
1次最適化法として、1回のデータバッチに対して複数回の勾配更新を可能にしつつ、破壊的なポリシー変更を防ぐ手法の設計。
MuJoCoの連続的制御タスクおよびアーケードゲームを含む多様なベンチマークで手法を評価し、優れたサンプル複雑性と耐性を示す。

提案手法

ポリシー改善の下界としての懸念を表すために、LCLIP(θ) = E_t[min(r_t(θ)A_t, clip(r_t(θ), 1-ϵ, 1+ϵ)A_t)] というクリッピングされた補助目的関数を提案。
ポリシー更新の大きさを測るための確率比 r_t(θ) = π_θ(a_t|s_t)/π_θ_old(a_t|s_t) を用い、大きな更新を防ぐためにクリッピングを実施。
同じデータセットに対して複数のエポックにわたりミニバッチ確率的勾配上昇を実行し、サンプル効率を向上。
保守的な更新戦略を採用：目的関数はクリッピングされたバージョンによって上限が設定され、極端に大きなポリシーの変化から利益を得ないよう保証。
共役勾配やヘッセ行列の近似を必要としない、単純な1次最適化スキーム（例：Adam）を採用。
標準的な方策勾配フレームワークへの最小限のコード変更で実装可能であり、実用性が非常に高い。

実験結果

リサーチクエスチョン

RQ1単純で1次的な方策最適化法は、TRPOの複雑さを避けつつも、そのサンプル効率と安定性を達成できるか？
RQ2クリッピングされた補助目的関数は、サンプルされたデータに対して複数回の最適化パスを可能にしつつ、破壊的なポリシー更新を効果的に防げるか？
RQ3連続的制御およびアーケードタスクにおいて、PPOはA2C、ACER、TRPOと比較してサンプル効率および最終的なパフォーマンスで優れているか？
RQ4豊富なハイパーパrameterチューニングを必要とせず、多様な環境に一般化できるか？

主な発見

ϵ = 0.2 としたPPOは、連続的制御ベンチマークで平均正規化スコア 0.82 を達成し、テストされたすべての設定および手法の中で最も優れた結果を示した。
MuJoCo環境では、A2C、信頼領域を用いたA2C、適応的ステップサイズを用いたvanilla方策勾配、TRPOのチューニング済み実装と比較して、ほぼすべてのタスクでPPOが優れたパフォーマンスを示した。
アーケードベンチマークでは、PPOは全訓練期間の平均報酬を用いて49ゲーム中30ゲームに勝利し、A2C（1勝）と比べて顕著に優れていた。
最終100エピソードでは、PPOは19ゲームに勝利し、A2C（1勝）を上回り、ACER（28勝）と同等のパフォーマンスを示した。これは、最終段階での強力な性能を示している。
ϵ = 0.2 のクリッピング目的関数が最も優れたパフォーマンスを発揮したが、適応的KLペナルティおよび固定βペナルティ手法は性能が劣った。
PPOは、前向き走行、ターゲット再配置、障害物回避を含む複雑な3Dヒューマノイドタスクでも高いパフォーマンスを発揮し、高次元制御へのスケーラビリティを示した。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。