QUICK REVIEW

[논문 리뷰] Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs

Mohammad Sadegh Talebi, Odalric-Ambrym Maillard|arXiv (Cornell University)|2018. 03. 05.

Reinforcement Learning in Robotics인용 수 19

한 줄 요약

이 논문은 MDP에서 할인되지 않은 강화학습에 대해 KL-UCRL 알고리즘을 사용하여 분산 인지(regret) 경계를 제시한다. 기존의 직경 기반 경계 대신 분산에 의존하는 항을 도입함으로써 개선된 결과를 이룬다. 주요 결과는 고확률(regret) 경계로 $\widetilde{\mathcal{O}}\left(\sqrt{S\sum_{s,a} \mathbf{V}^\star_{s,a}T}\right)$이며, 이는 MDP의 직경이나 행동 수가 아닌 편향 함수의 局부적 분산을 활용함으로써 이전의 경계보다 향상된다.

ABSTRACT

The problem of reinforcement learning in an unknown and discrete Markov Decision Process (MDP) under the average-reward criterion is considered, when the learner interacts with the system in a single stream of observations, starting from an initial state without any reset. We revisit the minimax lower bound for that problem by making appear the local variance of the bias function in place of the diameter of the MDP. Furthermore, we provide a novel analysis of the KL-UCRL algorithm establishing a high-probability regret bound scaling as $\\widetilde {\\mathcal O}\\Bigl({\ extstyle \\sqrt{S\\sum_{s,a}{\\bf V}^\\star_{s,a}T}}\\Big)$ for this algorithm for ergodic MDPs, where $S$ denotes the number of states and where ${\\bf V}^\\star_{s,a}$ is the variance of the bias function with respect to the next-state distribution following action $a$ in state $s$. The resulting bound improves upon the best previously known regret bound $\\widetilde {\\mathcal O}(DS\\sqrt{AT})$ for that algorithm, where $A$ and $D$ respectively denote the maximum number of actions (per state) and the diameter of MDP. We finally compare the leading terms of the two bounds in some benchmark MDPs indicating that the derived bound can provide an order of magnitude improvement in some cases. Our analysis leverages novel variations of the transportation lemma combined with Kullback-Leibler concentration inequalities, that we believe to be of independent interest.

연구 동기 및 목표

MDP의 직경을 편향 함수의 국부적 분산으로 대체함으로써 할인되지 않은 강화학습에서의 회귀 경계를 향상시키는 것.
편향 함수의 국부적 분산을 고려한 분석을 통해 에르고딕 MDP에서 KL-UCRL 알고리즘에 대한 더 날카운 고확률 회귀 경계를 제공하는 것.
행동 수나 직경이 아닌 상태-행동 분산의 합에 따라 척도가 조절되는 새로운 회귀 경계를 수립하는 것.
운반 문제의 새로운 변형과 KL 농도 불등식을 도입하여 MDP에서의 분석을 향상시키는 것.
새로운 경계가 기존의 직경 기반 경계에 비해 특정 벤치마크 MDP에서 주요 개선을 이끌어내는지 보여주는 것.

제안 방법

할인되지 않은 RL에 대한 최소최대 하한선을 재검토하며, MDP의 직경 대신 편향 함수의 국부적 분산을 도입한다.
KL-UCRL 알고리즘에 대한 새로운 분석을 제안하며, 분산 인지 농도 불등식과 운반 문제의 변형을 사용한다.
고확률 회귀 경계를 제안하며, 이는 $\widetilde{\mathcal{O}}\left(\sqrt{S\sum_{s,a} \mathbf{V}^\star_{s,a}T}\right)$의 척도를 가지며, 여기서 $\mathbf{V}^\star_{s,a}$는 상태 $s$에서 행동 $a$를 취할 때의 편향 함수의 분산이다.
벨만 최적성 방정식과 편향 함수 분해를 사용하여 회귀를 부적합성 갭과 상태 방문 수에 연결한다.
아주마-후프딩 및 칼리브라-레이블 농도 불등식을 적용하여 가치 추정의 이탈을 통제한다.
편향 항과 부적합성 갭을 분리하는 회귀 분해를 유도함으로써 분산 기반 통제를 가능하게 한다.

실험 결과

연구 질문

RQ1MDP의 직경을 편향 함수의 국부적 분산으로 대체함으로써 할인되지 않은 RL의 회귀 경계를 향상시킬 수 있는가?
RQ2분산 인지 분석을 적용할 경우 KL-UCRL 알고리즘이 더 날카운 회귀 경계를 달성하는가?
RQ3새로운 분산 기반 회귀 경계는 척도와 실증 성능 측면에서 기존의 $\widetilde{\mathcal{O}}(DS\sqrt{AT})$ 경계와 어떻게 비교되는가?
RQ4새로운 운반 문제의 변형과 KL 농도 불등식은 분산이 낮은 MDP에서 더 날카운 경계를 도출할 수 있는가?
RQ5어떤 MDP 구조에서 분산 인지 경계가 직경 기반 경계보다 뚜렷한 개선을 이끌어내는가?

주요 결과

제안된 회귀 경계는 $\widetilde{\mathcal{O}}\left(\sqrt{S\sum_{s,a} \mathbf{V}^\star_{s,a}T}\right)$이며, 이는 직경 $D$와 행동 수 $A$를 상태-행동 분산으로 대체한다.
이전까지 알려진 $\widetilde{\mathcal{O}}(DS\sqrt{AT})$ 경계보다 $D$와 $A$에 대한 의존성을 제거함으로써 KL-UCRL에 대한 성능을 향상시킨다.
벤치마크 MDP에서는 낮은 분산 항으로 인해 기존의 경계에 비해 주요 개선이 이루어질 수 있다.
이 분석은 이 논문 외부에서도 독립적인 관심을 끌 만한 새로운 운반 문제의 변형과 KL 농도 기법을 도입한다.
효과적인 회귀는 $D + \sum_{s,a} \mathbb{E}[N_T(s,a)] \varphi(s,a)$로 제한되며, 여기서 $\varphi(s,a)$는 부적합성 갭이고, 편향 항은 분산 인지 경계로 통제된다.
경계는 고확률로 성립하며, 유한한 상태 및 행동 공간을 가진 에르고딕 MDP를 전제로 한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.