QUICK REVIEW

[논문 리뷰] Lyapunov-based Safe Policy Optimization for Continuous Control

Yinlam Chow, Ofir Nachum|arXiv (Cornell University)|2019. 01. 28.

Reinforcement Learning in Robotics참고 문헌 30인용 수 152

한 줄 요약

이 논문은 연속 제어에서 CMDP에 대한 Lyapunov 기반 안전 정책 최적화를 제시하며, 두 가지 해법(theta-projection 및 a-projection)이 표준 정책 기울기(DDPG, PPO)와 통합되어 학습 중 및 수렴 시 안전을 보장하고 데이터 효율적인 온/오프 정책 데이터를 사용한다.

ABSTRACT

We study continuous action reinforcement learning problems in which it is crucial that the agent interacts with the environment only through safe policies, i.e.,~policies that do not take the agent to undesirable situations. We formulate these problems as constrained Markov decision processes (CMDPs) and present safe policy optimization algorithms that are based on a Lyapunov approach to solve them. Our algorithms can use any standard policy gradient (PG) method, such as deep deterministic policy gradient (DDPG) or proximal policy optimization (PPO), to train a neural network policy, while guaranteeing near-constraint satisfaction for every policy update by projecting either the policy parameter or the action onto the set of feasible solutions induced by the state-dependent linearized Lyapunov constraints. Compared to the existing constrained PG algorithms, ours are more data efficient as they are able to utilize both on-policy and off-policy data. Moreover, our action-projection algorithm often leads to less conservative policy updates and allows for natural integration into an end-to-end PG training pipeline. We evaluate our algorithms and compare them with the state-of-the-art baselines on several simulated (MuJoCo) tasks, as well as a real-world indoor robot navigation problem, demonstrating their effectiveness in terms of balancing performance and constraint satisfaction. Videos of the experiments can be found in the following link: https://drive.google.com/file/d/1pzuzFqWIE710bE2U6DmS59AfRzqK2Kek/view?usp=sharing.

연구 동기 및 목표

연속 제어에서 제약 Markov 결정 프로세스(CMDP)의 안전성에 기반한 강화학습 필요성에 대해 동기 부여한다.
정책 업데이트 시점마다 거의 제약 충족을 보장하는 Lyapunov 기반 정책 최적화 방법을 개발한다.
표준 정책 그래디언트 방법(PPO, DDPG)과의 호환성을 확보하고 효율성을 위해 온-정책 및 오프-정책 데이터를 활용한다.
무한/연속 행동 공간과 Lyapunov 제약을 처리하기 위한 두 가지 구현 가능한 접근 방식(theta-projection과 a-projection)를 제공한다.

제안 방법

상태 의존 Lyapunov 제약을 사용하여 누적 제약 비용을 한정하는 안전한 CMDP 최적화를 수식화한다.
두 가지 해법 체계를 도입한다: (i) theta-projection은 Lyapunov 제약 하에 정책 매개변수를 투영(projection)하여 최적화하는 방식; (ii) a-projection은 Lyapunov 제약을 안전 층으로 내장하고 행동을 feasible 집합으로 투영하는 방식.
무한한 Lyapunov 제약을 일차적으로 다루기 쉬운 미분 가능 형태로 근사화하는 테일러 급수 기반 대리함수를 사용한다.
온-정책(PPO) 및 오프-정책(DDPG) 알고리즘을 활용하여 데이터 효율성을 개선하고 엔드 투 엔드 학습을 가능하게 한다.
CPO, 라그랑지 방법 등 기존의 안전 접근법과의 연계성을 제공하고 Lyapunov 제약을 역전파 가능한 학습과 통합하는 방법을 보인다.
MuJoCo 벤치마크 및 실제 로봇 내비게이션 과제에서 안전한 학습 및 제약 충족도가 향상됨을 입증한다.

실험 결과

연구 질문

RQ1연속 행동 공간에서 안전성을 정책 업데이트마다 보장하면서 CMDP를 어떻게 해결할 수 있는가?
RQ2Lyapunov 기반 제약을 표준 PG 방법(PPO, DDPG)에 통합하여 안전하고 데이터 효율적인 학습을 달성할 수 있는가?
RQ3theta-projection과 a-projection가 CPO 및 라그랑지 방법과 같은 기존의 안전 강화학습 베이스라인과 비교해 실용적이고 확장 가능한 솔루션을 제공하는가?
RQ4제안된 방법이 시뮬레이션에서 실제 로봇 작업으로의 안전 보장을 얼마나 잘 이전하는가?

주요 결과

Lyapunov 기반의 PG 알고리즘은 학습 중 제약 충족을 유지하면서 경쟁력 있는 성능을 달성한다.
라그랑지 방법 및 CPO에 비해 제안된 접근은 데이터 효율적이며 온-정책 및 오프-정책 데이터를 모두 활용할 수 있다.
a-projection 안전 층은 종종 theta-projection보다 더 빠른 수렴과 덜 보수적인 업데이트를 제공하여 학습 속도와 안정성을 향상시킨다.
MuJoCo 과제와 실제 Fetch 로봇에서 이 방법들은 성능과 안전의 균형을 이루고 새로운 환경으로의 일반화 및 실제 하드웨어로의 이전이 더 잘 이루어진다.
이 프레임워크는 엔드 투 엔드로 구현 가능하며 PPO 또는 DDPG와 통합되어 탐색적 백프로파게이션 학습이 가능하며 라인 검색이나 비싼 백트래킹에 의존하지 않는다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.