QUICK REVIEW

[논문 리뷰] Learning Near Optimal Policies with Low Inherent Bellman Error

Andrea Zanette, Alessandro Lazaric|arXiv (Cornell University)|2020. 02. 29.

Advanced Bandit Algorithms Research참고 문헌 47인용 수 38

한 줄 요약

본 논문은 낮은 내재 벨만 오차를 가진 선형 가치 함수 근사치를 이용하는 에피소드형 강화학습에 대해 낙관적 LSVI 기반 알고리즘인 Eleanor를 제시하고, 거의 최적에 근접한 후회 경계와 이를 따르는 하한을 일치시키며, misspecification 처리 시 H=1에서 LinUCB로의 회복을 보인다.

ABSTRACT

We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning under the notion of low inherent Bellman error, a condition normally employed to show convergence of approximate value iteration. First we relate this condition to other common frameworks and show that it is strictly more general than the low rank (or linear) MDP assumption of prior work. Second we provide an algorithm with a high probability regret bound $\widetilde O(\sum_{t=1}^H d_t \sqrt{K} + \sum_{t=1}^H \sqrt{d_t} \IBE K)$ where $H$ is the horizon, $K$ is the number of episodes, $\IBE$ is the value if the inherent Bellman error and $d_t$ is the feature dimension at timestep $t$. In addition, we show that the result is unimprovable beyond constants and logs by showing a matching lower bound. This has two important consequences: 1) it shows that exploration is possible using only \emph{batch assumptions} with an algorithm that achieves the optimal statistical rate for the setting we consider, which is more general than prior work on low-rank MDPs 2) the lack of closedness (measured by the inherent Bellman error) is only amplified by $\sqrt{d_t}$ despite working in the online setting. Finally, the algorithm reduces to the celebrated extsc{LinUCB} when $H=1$ but with a different choice of the exploration parameter that allows handling misspecified contextual linear bandits. While computational tractability questions remain open for the MDP setting, this enriches the class of MDPs with a linear representation for the action-value function where statistically efficient reinforcement learning is possible.

연구 동기 및 목표

낮은 내재 벨만 오차(IBE) 하에서 근사적 선형 행동-가치 함수로 탐색을 촉진한다.
IBE가 저랭크 MDP 및 LSPI 조건과 어떻게 관련되는지 명확히 하고, 더 넓은 적용 가능성을 보인다.
Q-함수의 선형성을 보존하는 낙관적이고 전역적으로 최적화된 LSVI 스타일 알고리즘(Eleanor)을 개발한다.
정보 이론적으로 타이트한 후회 보장을 확립하고, 잘못 명시된 맥락 선형 설정에 대한 함의를 보여준다.

제안 방법

선형 Q-함수 클래스에 대한 내재 벨만 오차(IBE)를 정의하고 이를 선형 및 저랭크 MDP 프레임워크와 연결한다.
horizon 전체에 걸친 theta_t와 낙관적 섭동을 함께 선택하는 Planning Optimization Program을 해결함으로써 LSVI를 낙관적 설정으로 확장한다.
선형성 유지를 위해 매개변수 공간에서 엘립소이드 제약을 갖는 전역 최적화 섭동 ͥi(bar_t)을 도입하여 정확한 신뢰 경계를 가능하게 한다.
R(T) = ��{sum_{t=1}^H d_t sqrt{K}}{ } + sum_{t=1}^H sqrt{d_t} I K (up to polylog factors) 의 후회 경계를 도출하고, 여기서 I는 내재 벨만 오차(IBE)이다.
H=1일 때 Hölder 탐색 매개변수를 수정하여 맥락선형 밴딧에서의 잘못 명시화를 처리하는 LinUCB로 축소되는 것을 보인다.
계산적 고려사항과 맥락적 잘못 명시 선형 밴딧과의 연결을 논의한다.

실험 결과

연구 질문

RQ1낮은 내재 벨만 오차(IBE) 하에서 선형 Q-함수 클래스로 온라인 에피소드형 RL에서 탐색을 효과적으로 수행할 수 있는가?
RQ2IBE가 저랭크 MDP 및 LSPI 조건과 어떻게 관련되며 확장될 수 있는가?
RQ3선형성을 유지하고 잘못 명시화를 다루는 낙관적 LSVI 계열 알고리즘의 후회 보장은 무엇인가?
RQ4제안된 접근법이 특수한 경우(H=1)에서 알려진 결과(예: LinUCB)를 회복하는가, 그리고 잘못 명시가 경계에 어떤 영향을 주는가?

주요 결과

Eleanor는 ��{sum_{t=1}^H d_t sqrt{K}}{ } + sum_{t=1}^H sqrt{d_t} I K (up to polylog factors). 의 후회 경계를 달성한다.
내재 벨만 오차 프레임워크는 저랭크 MDP 가정보다 엄밀히 일반적이며, IBE의 sqrt{d_t} 배증폭으로 잘못 명시화를 처리할 수 있다.
해당 결과는 상수와 로그를 제외하고 더 나아질 수 없으며, 잘못 명시화 없는 설정에 대한 매칭 하한으로 입증된다.
H=1일 때, Eleanor는 맥락적 선형 밴딧의 잘못 명시화를 수용하기 위해 수정된 탐색 매개변수와 함께 LinUCB로 축소된다.
해석은 저랭크 MDP에도 확장되어 특징 차원의 제곱근에 해당하는 이전 경계보다 개선되며, 온라인 환경에서의 잘못 명시화를 관리하는 체계적인 방법을 제공한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.