QUICK REVIEW

[논문 리뷰] A Theory of Regularized Markov Decision Processes

Matthieu Geist, Bruno Scherrer|arXiv (Cornell University)|2019. 01. 31.

Adversarial Robustness in Machine Learning인용 수 85

한 줄 요약

논문은 정규화된 벨만 연산자와 Legendre-Fenchel 변환을 사용하여 일반화된 정규화된 MDP 이론을 개발하고, Mirror Descent 및 Bregman 발산에 기반한 단일 프레임워크 내에서 다양한 정규화된 DP/MDP 알고리즘을 통합하고 분석합니다.

ABSTRACT

Many recent successful (deep) reinforcement learning algorithms make use of regularization, generally based on entropy or Kullback-Leibler divergence. We propose a general theory of regularized Markov Decision Processes that generalizes these approaches in two directions: we consider a larger class of regularizers, and we consider the general modified policy iteration approach, encompassing both policy iteration and value iteration. The core building blocks of this theory are a notion of regularized Bellman operator and the Legendre-Fenchel transform, a classical tool of convex optimization. This approach allows for error propagation analyses of general algorithmic schemes of which (possibly variants of) classical algorithms such as Trust Region Policy Optimization, Soft Q-learning, Stochastic Actor Critic or Dynamic Policy Programming are special cases. This also draws connections to proximal convex optimization, especially to Mirror Descent.

연구 동기 및 목표

정규화된 벨만 평가 연산자와 그 속성을 형식적으로 도입
Legendre-Fenchel 기반의 정규화된 최적화 연산자 및 탐욕 정책 개발
정규화된(근사) 동적계획법 스킴에서의 오차 전파 분석
정규화된 MDP를 볼록 최적화 및 Mirror Descent와의 관계 제시
기존 알고리즘이 통일된 프레임워크의 특수한 경우임을 보임

제안 방법

정책에 대해 강하게 볼록한 정규화 항을 갖는 정규화된 벨만 연산자를 정의
Legendre-Fenchel 변환을 사용하여 정규화된 최대 연산자와 소프트 탐욕 정책을 얻음
정규화된 ADP를 정규화된 수정된 정책 반복 스킴에 포함시키고 수렴성 분석
정규화된 q 함수에 대한 Monte Carlo 또는 TD 스타일의 실용적 구현 제시
SAC, TRPO, DPP, MPO와 같은 실용 알고리즘을 특수한 경우로 관계짓고 회수
Bregman 발산을 도입하여 Mirror Descent 해석으로 연결하고 두 가지 MD-MPI 스킴을 제시

실험 결과

연구 질문

RQ1일반적인 정규화가 MDP의 불변점수와 최적 정책에 어떤 영향을 미치는가?
RQ2통합 연산자 프레임워크가 정규화된 DP 스킴의 수축성과 오차 전파를 보장할 수 있는가?
RQ3알려진 알고리즘이 정규화된 MPI/Mirror Descent 관점에서 어떻게 맞춰지는가?
RQ4정규화된 값 함수와 정책과 비교할 때 이론적 보장은 어떤가?

주요 결과

정규화된 벨만 연산자는 고전적인 연산자와 유사한 수축성과 단조성 특성을 유지합니다
정규화된 최적 가치 함수는 정규화된 최적화 연산자의 불변점이며 유일한 최적 정규화 정책을 산출합니다
정규화가 정규화된 가치 함수와 비정규화된 가치 함수 간의 경계 조건을 제시하며 편차를 제어합니다
reg MPI에 대한 오차 전파 bound가 AMPI 결과를 정규화된 설정으로 확장됩니다
프레임워크가 단일 이론 내에서 여러 최첨단 알고리즘을 특수한 경우로 회복하고 설명합니다
Bregman 발산을 도입하면 Mirror Descent 해석으로 이어지며 TRPO와 MPO 같은 기존 알고리즘과의 연결점을 제공합니다

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.