QUICK REVIEW

[논문 리뷰] Data-Efficient Reinforcement Learning with Probabilistic Model Predictive Control

Sanket Kamthe, Marc Peter Deisenroth|arXiv (Cornell University)|2017. 06. 20.

Advanced Control Systems Optimization참고 문헌 35인용 수 110

한 줄 요약

모델 기반 RL 프레임워크를 사용하여 Gaussian Processes와 확률적 MPC를 통해 상태 및 제어 제약 하에서 데이터 효율적인 학습을 달성하고, Pontryagin’s Maximum Principle에 의한 이론적 보장을 제공한다.

ABSTRACT

Trial-and-error based reinforcement learning (RL) has seen rapid advancements in recent times, especially with the advent of deep neural networks. However, the majority of autonomous RL algorithms require a large number of interactions with the environment. A large number of interactions may be impractical in many real-world applications, such as robotics, and many practical systems have to obey limitations in the form of state space or control constraints. To reduce the number of system interactions while simultaneously handling constraints, we propose a model-based RL framework based on probabilistic Model Predictive Control (MPC). In particular, we propose to learn a probabilistic transition model using Gaussian Processes (GPs) to incorporate model uncertainty into long-term predictions, thereby, reducing the impact of model errors. We then use MPC to find a control sequence that minimises the expected long-term cost. We provide theoretical guarantees for first-order optimality in the GP-based transition models with deterministic approximate inference for long-term planning. We demonstrate that our approach does not only achieve state-of-the-art data efficiency, but also is a principled way for RL in constrained environments.

연구 동기 및 목표

강화학습의 데이터 비효율성을 확률적 동역학을 갖춘 모델 기반 접근 방식으로 해결한다.
장기 계획에 모델 불확실성을 반영하여 모델 오차의 영향을 줄인다.
연산 부담을 관리하고 제약 처리를 가능하게 하기 위해 모델 예측 제어를 통해 짧은 수평으로 계획한다.
GP 기반 동역학에 대한 일차 최적성에 대한 이론적 보장을 제공한다.
데이터 효율성을 유지하면서 상태 및 제어 제약 처리를 시연한다.

제안 방법

Gaussian Processes를 사용하여 dynamics와 불확실성을 포착하는 확률적 전이 모델을 학습한다.
moment matching을 사용하여 시간 동안 GP 기반의 불확실성을 전파하고 결정적 장기 예측을 얻는다.
확률적 MPC 문제를 결정적 최적 제어 문제로 재정의하고 제약 계획을 위해 Pontryagin’s Maximum Principle을 적용한다.
GP 동역학을 갖는 MPC 내에서 개방 루프 최적화를 사용하고 Hamiltonian을 통해 효율적인 그라디언트를 도출하여 SQP/BFGS 기반 최적화를 수행한다.
정책 매개변수화에 의존하지 않고 PMP 기반 최소 조건을 통해 상태 및 제어 제약을 반영한다.
각 시도 후 GP 모델을 온라인으로 업데이트하여 전체 수평에 대한 재계획 없이 계획을 정제한다.

실험 결과

연구 질문

RQ1GP 동역학을 갖춘 probabilistic MPC가 벤치마크 제어 과제에서 PILCO보다 빠른 데이터 효율적 학습을 보이나?
RQ2제약을 유지하면서 데이터 효율성과 최적성을 달성할 수 있는가?
RQ3계획에 GP 불확실성을 반영하는 것이 학습 중 안전 및 제약 만족도에 어떤 영향을 미치는가?

주요 결과

Experiment	PILCO	GP-MPC-Mean	GP-MPC-Var
Cart-pole	16/100	21/100	3/100
Double Pendulum	23/100	26/100	11/100

GP-MPC가 cart-pole 및 double-pendulum swing-up 과제에서 PILCO에 비해 데이터 효율에서 우수하다.
GP-MPC는 더 적은 시도로 높은 성공을 달성: cart-pole은 약 6회 시도 후 90% 성공, double pendulum도 약 6회 시도 후 달성, 반면 PILCO는 더 많은 시도가 필요.
제약 설정에서 GP-MPC(불확실성 포함, GP-MPC-Var)가 일관되게 문제를 해결하는 반면, 평균만으로 계획하는 GP-MPC-Mean은 일부 경우에서 실패하고 PILCO는 종종 제약을 위반한다.
확률 제약(chance constraints)을 가진 GP-MPC은 평균 기반 계획에 비해 기대 위반을 크게 줄여, 안전을 위한 불확실성 모델링의 중요성을 강조한다.
이 방법은 PMP와 모멘트 매칭 GP 동역학을 통해 이론적 보장을 제공하면서 데이터 효율성 측면에서 최첨단을 달성한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.