QUICK REVIEW

[Paper Review] An Improved Convergence Analysis of Stochastic Variance-Reduced Policy Gradient

Pan Xu, Felicia Gao|arXiv (Cornell University)|May 29, 2019

Reinforcement Learning in Robotics25 references33 citations

TL;DR

The paper provides a tighter convergence analysis of SVRPG, showing it achieves an epsilon-approximate stationary point with O(1/epsilon^{5/3}) trajectories, improving over O(1/epsilon^2).

ABSTRACT

We revisit the stochastic variance-reduced policy gradient (SVRPG) method proposed by Papini et al. (2018) for reinforcement learning. We provide an improved convergence analysis of SVRPG and show that it can find an $\\epsilon$-approximate stationary point of the performance function within $O(1/\\epsilon^{5/3})$ trajectories. This sample complexity improves upon the best known result $O(1/\\epsilon^2)$ by a factor of $O(1/\\epsilon^{1/3})$. At the core of our analysis is (i) a tighter upper bound for the variance of importance sampling weights, where we prove that the variance can be controlled by the parameter distance between different policies; and (ii) a fine-grained analysis of the epoch length and batch size parameters such that we can significantly reduce the number of trajectories required in each iteration of SVRPG. We also empirically demonstrate the effectiveness of our theoretical claims of batch sizes on reinforcement learning benchmark tasks.

Motivation & Objective

Motivate and analyze stochastic variance-reduced policy gradient (SVRPG) in reinforcement learning.
Provide a tighter convergence bound for SVRPG than prior work.
Show how variance of importance sampling weights can be controlled by policy distance and how epoch/batch choices impact sample complexity.
Demonstrate empirical effectiveness on standard RL benchmarks (Cartpole, Mountain Car).

Proposed method

Revisit the SVRPG framework that combines SVRG with policy gradient estimators (REINFORCE/GPOMDP).
Derive a tighter variance bound for importance sampling weights in non-stationary trajectory distributions.
Perform a refined analysis of epoch length and batch sizes to reduce trajectories per iteration.
Prove that SVRPG achieves E[||∇J(θ_out)||^2] ≤ ε with O(1/ε^{5/3}) trajectories.
Provide corollaries to relate step size, batch sizes, and epoch length to total sample complexity.
Empirically validate batch-size choices on RL benchmarks Cartpole and Mountain Car.

Experimental results

Research questions

RQ1Can SVRPG be provably faster than vanilla stochastic policy gradient methods in terms of sample complexity?
RQ2What are the tight variance bounds for importance weights in SVRPG under non-stationary sampling?
RQ3How should epoch length and batch sizes be chosen to minimize trajectory requirements while preserving convergence?
RQ4Do the theoretical improvements translate into practical gains on standard RL tasks?

Key findings

SVRPG can find an ε-approximate stationary point with O(1/ε^{5/3}) trajectories.
This improves over the best known O(1/ε^{2}) trajectory complexity by a factor of O(1/ε^{1/3}).
A tighter upper bound shows the variance of importance sampling weights can be controlled by the parameter distance between policies.
A refined epoch-batch scheduling reduces the number of trajectories needed per iteration without losing convergence rate.
Empirical experiments on Cartpole and Mountain Car corroborate the theoretical advantages of the proposed batch-size choices.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.