QUICK REVIEW

[論文レビュー] Lyapunov-based Safe Policy Optimization for Continuous Control

Yinlam Chow, Ofir Nachum|arXiv (Cornell University)|Jan 28, 2019

Reinforcement Learning in Robotics参考文献 30被引用数 152

ひとこと要約

本論文は、連続制御における CMDP の安全を保証する Lyapunov-based Safe Policy Optimization を導入し、theta-projection と a-projection の2つの解法を提示。これらは標準の policy gradients（DDPG、PPO）と統合され、トレーニング中および収束時の安全性を保証し、データ効率の高いオン/オフポリシーデータを使用します。

ABSTRACT

We study continuous action reinforcement learning problems in which it is crucial that the agent interacts with the environment only through safe policies, i.e.,~policies that do not take the agent to undesirable situations. We formulate these problems as constrained Markov decision processes (CMDPs) and present safe policy optimization algorithms that are based on a Lyapunov approach to solve them. Our algorithms can use any standard policy gradient (PG) method, such as deep deterministic policy gradient (DDPG) or proximal policy optimization (PPO), to train a neural network policy, while guaranteeing near-constraint satisfaction for every policy update by projecting either the policy parameter or the action onto the set of feasible solutions induced by the state-dependent linearized Lyapunov constraints. Compared to the existing constrained PG algorithms, ours are more data efficient as they are able to utilize both on-policy and off-policy data. Moreover, our action-projection algorithm often leads to less conservative policy updates and allows for natural integration into an end-to-end PG training pipeline. We evaluate our algorithms and compare them with the state-of-the-art baselines on several simulated (MuJoCo) tasks, as well as a real-world indoor robot navigation problem, demonstrating their effectiveness in terms of balancing performance and constraint satisfaction. Videos of the experiments can be found in the following link: https://drive.google.com/file/d/1pzuzFqWIE710bE2U6DmS59AfRzqK2Kek/view?usp=sharing.

研究の動機と目的

連続制御における制約付きマルコフ決定過程（CMDP）を通じた安全性が重要な強化学習を動機づける。
各ポリシー更新時にほぼ制約満足を保証するLyapunovベースのポリシー最適化法を開発する。
標準のポリシー勾配法（DDPG、PPO）との互換性を確立し、効率のためにオンポリシーとオフポリシーのデータの両方を活用する。
無限/連続アクション空間とLyapunov制約に対応するための実装可能な2つのアプローチ（theta-projection と a-projection）を提供する。

提案手法

安全な CMDP 最適化を状態依存の Lyapunov 制約を用いて累積制約コストを上限化して定式化する。
2つの解法を導入する：(i) theta-projection はLyapunov制約の下で射影を通じてポリシーパラメータを最適化する；(ii) a-projection はLyapunov制約を安全レイヤとして組み込み、行動を実現可能集合へ射影する。
無限の Lyapunov 制約を勾配法ベースの更新のために扱いやすく微分可能な形へ変換する Taylor-series 基づく代替手法を用いる。
データ効率を高め、エンドツーエンドの訓練を可能にするため、オンポリシー（PPO）とオフポリシー（DDPG）アルゴリズムを活用する。
既存の安全手法（CPO、Lagrangian）への接続を提供し、Lyapunov 制約がバックプロパゲーション可能な訓練と統合できることを示す。
MuJoCo のベンチマークと実世界のロボットナビゲーションタスクで、安全な訓練と制約満足の向上を実証する。

実験結果

リサーチクエスチョン

RQ1連続アクション空間でCMDPを解き、各ポリシー更新時に安全を保証するにはどうすればよいか？
RQ2Lyapunovベースの制約を標準のPG法（PPO、DDPG）と統合して安全でデータ効率の良い学習を達成できるか？
RQ3theta-projection と a-projection は、CPO やラグランジアン法などの既存の安全なRLベースラインと同等またはそれ以上の性能を持つ、実用的でスケーラブルな解を提供するか？
RQ4提案手法はシミュレーションから実世界のロボティクス課題へ安全性保証をどれだけうまく移行できるか？

主な発見

LyapunovベースのPGアルゴリズムは、訓練中の制約満足を維持しつつ競争力のある性能を達成する。
ラグランジュ法および CPO と比較して、提案手法はデータ効率が高く、オンポリシーとオフポリシーのデータの双方を利用できる。
a-projection 安全レイヤは、theta-projection より収束が速く、より保守的でない更新をもたらすことが多く、学習速度と安定性を高める。
MuJoCoタスクと実機のFetchロボットで、手法は性能と安全のバランスを取り、新しい環境への一般化と実機への転送性が向上する。
このフレームワークはエンドツーエンドで実装可能で、PPOまたはDDPGと統合でき、ラインサーチや高価なバックトラッキングに依存せずバックプロパゲーション可能な訓練を実現できる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。