QUICK REVIEW

[Paper Review] Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage Decomposition

Zihan Zhang, Yuan Zhou|arXiv (Cornell University)|Apr 21, 2020

Reinforcement Learning in Robotics26 references45 citations

TL;DR

The paper introduces UCB-Advantage, a model-free RL algorithm with a reference-advantage decomposition that achieves near-optimal regret for finite-horizon episodic MDPs and enjoys low switching costs, matching model-based methods up to logarithmic factors.

ABSTRACT

We study the reinforcement learning problem in the setting of finite-horizon episodic Markov Decision Processes (MDPs) with $S$ states, $A$ actions, and episode length $H$. We propose a model-free algorithm UCB-Advantage and prove that it achieves $ ilde{O}(\sqrt{H^2SAT})$ regret where $T = KH$ and $K$ is the number of episodes to play. Our regret bound improves upon the results of [Jin et al., 2018] and matches the best known model-based algorithms as well as the information theoretic lower bound up to logarithmic factors. We also show that UCB-Advantage achieves low local switching cost and applies to concurrent reinforcement learning, improving upon the recent results of [Bai et al., 2019].

Motivation & Objective

Motivate the question of whether model-free RL can attain learning efficiency comparable to model-based methods while maintaining low space/time complexity.
Propose a novel model-free algorithm, UCB-Advantage, that uses reference-advantage decomposition to improve regret and data efficiency.
Show that UCB-Advantage attains regret matching optimal model-based bounds up to logarithmic factors and exhibits low local switching costs.
Extend the approach to concurrent RL settings, highlighting practical benefits for batched or parallel learning.

Proposed method

Introduce a stage-based update framework where each state-action-holistic triple (s,a,h) collects data in stages with exponentially growing lengths.
Propose a reference-advantage decomposition V* = Vref + (V* − Vref) and update Q using two terms: (i) a reference-based term estimated with all samples, and (ii) an advantage-based term estimated with samples from the current stage only.
Provide an advantage-based update rule: Q_h(s,a) ← P_s,a,h V_ref_{h+1} + P_s,a,h (V_{h+1} − V_ref_{h+1}) + r_h(s,a) + b (with b as an exploration bonus).
Adopt a standard update rule in parallel, enabling integration of the two rules within the stage-based framework.
Learn a fixed reference value function Vref with bounded sample complexity and progressively refine it during learning.
Present theoretical guarantees: (i) regret bound Regret(T) ≤ ~O(√(H^2 S A T)) with high probability, (ii) improved local switching cost O(S A H^2 log T) compared to prior work, and (iii) a corollary for concurrent RL with near-optimal episode complexity.

Experimental results

Research questions

RQ1Can model-free reinforcement learning achieve regret bounds comparable to model-based approaches in finite-horizon episodic MDPs?
RQ2Does a reference-advantage decomposition reduce variance and improve data efficiency in model-free Q-learning?
RQ3How does a stage-based update framework influence switching costs and practicality for concurrent RL?
RQ4What are the theoretical limits (lower bounds) for model-free methods in this setting, and how close can they get to model-based guarantees?

Key findings

UCB-Advantage achieves regret bound of ~O(√(H^2 S A T)) with high probability, matching the information-theoretic lower bound up to logarithmic factors.
The algorithm reduces the √H gap relative to prior model-free methods and matches the performance of top model-based algorithms like UCBVI and vUCQ up to log factors.
The stage-based update framework yields a low local switching cost of O(S A H^2 log T), improving upon prior results.
The approach extends to concurrent RL, offering epsilon-optimal policies in ~O(H^2 S A + H^3 S A / (ε^2 M)) concurrent episodes, with an accompanying lower bound showing near-optimality.
The reference-advantage decomposition enables using all samples for the reference term while restricting the more variable second term to the latest stage, reducing variance and enabling tighter regret analysis.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.