[Paper Review] Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP
The paper introduces a Q-learning algorithm with UCB exploration for infinite-horizon discounted MDPs without a generative model and proves a PAC-MMD-style sample complexity bound of ".tilde{O}(SA) / (ε^2 (1−γ)^7)" for exploration.
A fundamental question in reinforcement learning is whether model-free algorithms are sample efficient. Recently, Jin et al. \cite{jin2018q} proposed a Q-learning algorithm with UCB exploration policy, and proved it has nearly optimal regret bound for finite-horizon episodic MDP. In this paper, we adapt Q-learning with UCB-exploration bonus to infinite-horizon MDP with discounted rewards \emph{without} accessing a generative model. We show that the extit{sample complexity of exploration} of our algorithm is bounded by $ ilde{O}({\frac{SA}{ε^2(1-γ)^7}})$. This improves the previously best known result of $ ilde{O}({\frac{SA}{ε^4(1-γ)^8}})$ in this setting achieved by delayed Q-learning \cite{strehl2006pac}, and matches the lower bound in terms of $ε$ as well as $S$ and $A$ except for logarithmic factors.
Motivation & Objective
- Motivate the study of sample efficiency for model-free RL without simulators in infinite-horizon discounted MDPs.
- Propose a Q-learning algorithm augmented with UCB exploration bonuses.
- Establish a PAC-like sample complexity bound for the exploration process in this setting.
Proposed method
- Propose Infinite Q-learning with UCB (Algorithm 1) that maintains optimistic Q estimates Q(s,a) and a lower-credible bound or each (s,a).
- Incorporate an exploration bonus b_k = c2/(1-l) * sqrt(H * iota(k) / k) into Q-value updates.
- Use a slowly changing learning rate alpha_k = (H+1)/(H+k) and track counts N(s,a) to guide exploration.
- Define a sufficient condition for ε-optimality at time t and connect it to a trajectory-based error bound (Condition 1 and Condition 2).
- Prove a PAC-MDP bound on the number of ε-suboptimal steps across the infinite horizon, leveraging a key lemma bounding weighted learning errors (Lemma 2).
- Show that the sample complexity of exploration is or Algorithm 1: lat O~(SA / (ε^2 (1-rac)^{7})).
Experimental results
Research questions
- RQ1What is the sample complexity of exploration for model-free Q-learning with UCB exploration in infinite-horizon discounted MDPs without a generative model?
- RQ2Can UCB-style exploration improve over prior model-free algorithms (e.g., Delayed Q-learning) in the infinite-horizon setting?
- RQ3How to define and bound ε-optimality across an infinite trajectory, and what sufficient conditions ensure ε-optimality at a given time step?
- RQ4How do the new analysis techniques adapt from finite-horizon to infinite-horizon MDPs in PAC-MDP terms?
Key findings
- The proposed UCB-Q learning algorithm achieves a sample complexity of exploration bound of lat O~(SA / (ε^2 (1-rac)^7)) with high probability.
- This bound improves the previously best-known result of lat O~(SA / (ε^4 (1-rac)^8)) from Delayed Q-learning in the infinite-horizon setting.
- The result matches the ε, S, A dependence up to logarithmic factors with the corresponding lower bound in ε and S,A up to logs.
- The analysis highlights essential differences between infinite-horizon and finite-horizon MDPs, including the trajectory-wide error propagation and non-consecutive time-step error structure.
- The algorithm stores only O(SA) values, offering memory efficiency advantages over some model-based alternatives.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.