QUICK REVIEW

[논문 리뷰] Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP

Kefan Dong, Yuanhao Wang|arXiv (Cornell University)|2019. 01. 27.

Reinforcement Learning in Robotics참고 문헌 16인용 수 37

한 줄 요약

The paper introduces a Q-learning algorithm with UCB exploration for infinite-horizon discounted MDPs without a generative model and proves a PAC-MMD-style sample complexity bound of ".tilde{O}(SA) / (ε^2 (1−γ)^7)" for exploration.

ABSTRACT

A fundamental question in reinforcement learning is whether model-free algorithms are sample efficient. Recently, Jin et al. \cite{jin2018q} proposed a Q-learning algorithm with UCB exploration policy, and proved it has nearly optimal regret bound for finite-horizon episodic MDP. In this paper, we adapt Q-learning with UCB-exploration bonus to infinite-horizon MDP with discounted rewards \emph{without} accessing a generative model. We show that the extit{sample complexity of exploration} of our algorithm is bounded by $ ilde{O}({\frac{SA}{ε^2(1-γ)^7}})$. This improves the previously best known result of $ ilde{O}({\frac{SA}{ε^4(1-γ)^8}})$ in this setting achieved by delayed Q-learning \cite{strehl2006pac}, and matches the lower bound in terms of $ε$ as well as $S$ and $A$ except for logarithmic factors.

연구 동기 및 목표

Model-free RL without simulators in infinite-horizon discounted MDPs의 샘플 효율성 연구 동기 부여.
UCB 탐색 보너스로 보강된 Q-learning 알고리즘 제안.
이 설정에서 탐색 과정에 대한 PAC 유사 샘플 복잡도 경계(bound) 확립.

제안 방법

Infinite Q-learning with UCB (Algorithm 1) that maintains optimistic Q estimates Q(s,a) and a lower-credible bound 〃or each (s,a).
Incorporate an exploration bonus b_k = c2/(1-〃l) * sqrt(H * iota(k) / k) into Q-value updates.
Use a slowly changing learning rate alpha_k = (H+1)/(H+k) and track counts N(s,a) to guide exploration.
Define a sufficient condition for ε-optimality at time t and connect it to a trajectory-based error bound (Condition 1 and Condition 2).
Prove a PAC-MDP bound on the number of ε-suboptimal steps across the infinite horizon, leveraging a key lemma bounding weighted learning errors (Lemma 2).
Show that the sample complexity of exploration is 〃or Algorithm 1: 〃lat O~(SA / (ε^2 (1-〃rac)^7)).

실험 결과

연구 질문

RQ1무제한-호라이즌 discounted MDP에서 generative model 없이 모델-프리 Q-learning with UCB 탐색의 탐색 샘플 복잡도는 얼마인가?
RQ2무제한-호라이즌 설정에서 UCB 스타일 탐색이 이전 모델-프리 알고리즘들(예: Delayed Q-learning)보다 개선될 수 있는가?
RQ3무한 궤적에서 ε-최적성의 정의와 경계는 어떻게 설정하고, 주어진 시간 단계에서 ε-최적성을 보장하는 충분 조건은 무엇인가?
RQ4PAC-MDP 관점에서 무한-호라이즌 MDP로의 분석 기법이 유한-호라이즌에서 어떻게 적응되는가?

주요 결과

제안된 UCB-Q 학습 알고리즘은 high probability 하에서 탐색 샘플 복잡도 경계가 〃lat O~(SA / (ε^2 (1-〃rac)^7))임을 달성한다.
이 경계는 무한-호라이즌 설정에서 Delayed Q-learning으로부터 알려진 최적의 결과 〃lat O~(SA / (ε^4 (1-〃rac)^8))를 개선한다.
결과는 ε, S, A 의 의존성을 로그 요소를 제외하면 대응하는 하한(bounds)과 상한(bounds)이 로그에 의해 차이가 있는 수준으로 일치한다.
분석은 무한-호라이즌과 유한-호라이즌 MDP 간의 본질적 차이점—궤적 전체의 오차 전파 및 비연속 시간-단계 오차 구조—를 강조한다.
알고리즘은 O(SA) 값만 저장하므로 일부 모델 기반 대안들에 비해 기억 공간 측면에서 이점이 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.