QUICK REVIEW

[論文レビュー] Learning to Walk via Deep Reinforcement Learning

Tuomas Haarnoja, Sehoon Ha|arXiv (Cornell University)|Dec 26, 2018

Robotic Locomotion and Control被引用数 42

ひとこと要約

この論文は、最小限のハイパーパラメータ調整で実機の四足歩行を直接学習させるサンプル効率の高いエントロピー正則化深層強化学習手法を提示し、Minitaur上で実証され、シミュレーションで検証された。

ABSTRACT

Deep reinforcement learning (deep RL) holds the promise of automating the acquisition of complex controllers that can map sensory inputs directly to low-level actions. In the domain of robotic locomotion, deep RL could enable learning locomotion skills with minimal engineering and without an explicit model of the robot dynamics. Unfortunately, applying deep RL to real-world robotic tasks is exceptionally difficult, primarily due to poor sample complexity and sensitivity to hyperparameters. While hyperparameters can be easily tuned in simulated domains, tuning may be prohibitively expensive on physical systems, such as legged robots, that can be damaged through extensive trial-and-error learning. In this paper, we propose a sample-efficient deep RL algorithm based on maximum entropy RL that requires minimal per-task tuning and only a modest number of trials to learn neural network policies. We apply this method to learning walking gaits on a real-world Minitaur robot. Our method can acquire a stable gait from scratch directly in the real world in about two hours, without relying on any model or simulation, and the resulting policy is robust to moderate variations in the environment. We further show that our algorithm achieves state-of-the-art performance on simulated benchmarks with a single set of hyperparameters. Videos of training and the learned policy can be found on the project website.

研究の動機と目的

明示的な力学モデルや歩容設計を用いず、エンドツーエンドの locomotion 学習を促進する。
実世界のロボットに対してハイパーパラメータに頑健なサンプル効率の良い RL アルゴリズムを開発する。
タスクごとのハイパーパラメータ調整を減らすため、自動エントロピー（温度）調整を可能にする。
物理的な四足歩行体上で安定した locomotion 歩法の学習を直接実証し、ロバスト性を評価する。

提案手法

温度パラメータの手動調整を避けるため、エントロピー制約付き目的を用いて最大エントロピー RL を拡張する。
デュアル勾配更新を用いてターゲットエントロピーを満たすよう温度を自動調整する。
二つの Q 関数と確率的ガウス方策を備えたソフトアクター-クリティック系を採用する。
データ収集、モーションキャプチャ報酬、別個のトレーニングパイプラインを備え、実機上で非同期に学習する。
OpenAI Gym のベンチマークと、実機およびシミュレーション設定の Minitaur ロボットで評価する。

実験結果

リサーチクエスチョン

RQ1エントロピー制約付き最大エントロピー RL は、最小限のハイパーパラメータ調整で実ロボット上の locomotion を直接学習できるのか？
RQ2学習済みポリシーは未見の地形や現実世界の摂動に一般化するか？
RQ3シミュレーションベンチマークにおける Baselineと比較した場合の方法の性能はどうか、固定温度と適応温度のどちらでの性能か？
RQ4提案されたエントロピー調整機構から得られるデータ効率とロバスト性の利点は何か？

主な発見

本手法は約2時間（約400 ロールアウト）で Minitaur における実世界での安定した歩行を達成する。
OpenAI Gym のベンチマーク全体で、固定温度の SAC の性能と同等またはそれを上回り、同じハイパーパラメータを用いている。
自動エントロピー調整により報酬スケールとターゲットエントロピーに対する感度が低減し、タスクを横断したロバスト性が向上する。
シミュレーションではデータ効率とロバスト性の最前例を示し、横方向の摂動に最大で 220 N の耐性を含む。
Minitaur で学習された歩法は周期的で同期しており、デフォルトの駆歩と同程度の速度を保ちつつ関節軌道は異なり、見たことのない障害物や地形にも一般化する（障害物を含む平坦地での訓練）。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。