QUICK REVIEW

[Paper Review] Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP

Kefan Dong, Yuanhao Wang|arXiv (Cornell University)|Jan 27, 2019

Reinforcement Learning in Robotics16 references37 citations

TL;DR

The paper introduces a Q-learning algorithm with UCB exploration for infinite-horizon discounted MDPs without a generative model and proves a PAC-MMD-style sample complexity bound of ".tilde{O}(SA) / (ε^2 (1−γ)^7)" for exploration.

ABSTRACT

A fundamental question in reinforcement learning is whether model-free algorithms are sample efficient. Recently, Jin et al. \cite{jin2018q} proposed a Q-learning algorithm with UCB exploration policy, and proved it has nearly optimal regret bound for finite-horizon episodic MDP. In this paper, we adapt Q-learning with UCB-exploration bonus to infinite-horizon MDP with discounted rewards \emph{without} accessing a generative model. We show that the extit{sample complexity of exploration} of our algorithm is bounded by $ ilde{O}({\frac{SA}{ε^2(1-γ)^7}})$. This improves the previously best known result of $ ilde{O}({\frac{SA}{ε^4(1-γ)^8}})$ in this setting achieved by delayed Q-learning \cite{strehl2006pac}, and matches the lower bound in terms of $ε$ as well as $S$ and $A$ except for logarithmic factors.

Motivation & Objective

Motivate the study of sample efficiency for model-free RL without simulators in infinite-horizon discounted MDPs.
Propose a Q-learning algorithm augmented with UCB exploration bonuses.
Establish a PAC-like sample complexity bound for the exploration process in this setting.

Proposed method

Propose Infinite Q-learning with UCB (Algorithm 1) that maintains optimistic Q estimates Q(s,a) and a lower-credible bound or each (s,a).
Incorporate an exploration bonus b_k = c2/(1-l) * sqrt(H * iota(k) / k) into Q-value updates.
Use a slowly changing learning rate alpha_k = (H+1)/(H+k) and track counts N(s,a) to guide exploration.
Define a sufficient condition for ε-optimality at time t and connect it to a trajectory-based error bound (Condition 1 and Condition 2).
Prove a PAC-MDP bound on the number of ε-suboptimal steps across the infinite horizon, leveraging a key lemma bounding weighted learning errors (Lemma 2).
Show that the sample complexity of exploration is or Algorithm 1: lat O~(SA / (ε^2 (1-rac)^{7})).

Experimental results

Research questions

RQ1What is the sample complexity of exploration for model-free Q-learning with UCB exploration in infinite-horizon discounted MDPs without a generative model?
RQ2Can UCB-style exploration improve over prior model-free algorithms (e.g., Delayed Q-learning) in the infinite-horizon setting?
RQ3How to define and bound ε-optimality across an infinite trajectory, and what sufficient conditions ensure ε-optimality at a given time step?
RQ4How do the new analysis techniques adapt from finite-horizon to infinite-horizon MDPs in PAC-MDP terms?

Key findings

The proposed UCB-Q learning algorithm achieves a sample complexity of exploration bound of lat O~(SA / (ε^2 (1-rac)^7)) with high probability.
This bound improves the previously best-known result of lat O~(SA / (ε^4 (1-rac)^8)) from Delayed Q-learning in the infinite-horizon setting.
The result matches the ε, S, A dependence up to logarithmic factors with the corresponding lower bound in ε and S,A up to logs.
The analysis highlights essential differences between infinite-horizon and finite-horizon MDPs, including the trajectory-wide error propagation and non-consecutive time-step error structure.
The algorithm stores only O(SA) values, offering memory efficiency advantages over some model-based alternatives.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.