Skip to main content
QUICK REVIEW

[Paper Review] Sample Efficient Policy Gradient Methods with Recursive Variance Reduction

Pan Xu, Felicia Gao|arXiv (Cornell University)|Sep 18, 2019
Reinforcement Learning in Robotics65 references34 citations
TL;DR

The paper introduces SRVR-PG, a stochastic recursive variance reduced policy gradient method achieving O(1/ε^{3/2}) sample complexity to reach an ε-approximate stationary point, with a SRVR-PG-PE variant for parameter-space exploration, validated on classic control tasks.

ABSTRACT

Improving the sample efficiency in reinforcement learning has been a long-standing research problem. In this work, we aim to reduce the sample complexity of existing policy gradient methods. We propose a novel policy gradient algorithm called SRVR-PG, which only requires $O(1/ε^{3/2})$ episodes to find an $ε$-approximate stationary point of the nonconcave performance function $J(\boldsymbolθ)$ (i.e., $\boldsymbolθ$ such that $\| abla J(\boldsymbolθ)\|_2^2\leqε$). This sample complexity improves the existing result $O(1/ε^{5/3})$ for stochastic variance reduced policy gradient algorithms by a factor of $O(1/ε^{1/6})$. In addition, we also propose a variant of SRVR-PG with parameter exploration, which explores the initial policy parameter from a prior probability distribution. We conduct numerical experiments on classic control problems in reinforcement learning to validate the performance of our proposed algorithms.

Motivation & Objective

  • Motivate reducing sample complexity in policy gradient methods for nonconvex performance functions.
  • Propose SRVR-PG to achieve improved sample efficiency via recursive variance reduction.
  • Develop a variant SRVR-PG-PE that adds parameter-based exploration.
  • Provide theoretical guarantees on convergence and sample complexity.
  • Demonstrate empirical performance on classical reinforcement learning control tasks.

Proposed method

  • Introduce a stochastic recursive variance reduced policy gradient (SRVR-PG) algorithm with S epochs and an outer snapshot gradient.
  • Use a recursive semi-stochastic gradient estimator v t+1 comprising a current-trajectory gradient term and a step-wise importance-weighted snapshot term (omega), plus a recursion v t+1 = v t + (1/B) sum_j [g(tau_j|θ_t) - g_ω(tau_j|θ_{t-1})].
  • Employ importance weighting to align distributions when sampling from the current policy but estimating with a snapshot policy, ensuring E[g_ω(τ|θ_{t-1})] matches E[g(τ|θ_{t-1})].
  • Update θ via projected gradient ascent θ_{t+1} = P_Θ(θ_t + η v_t) where P_Θ is projection onto a convex constraint set Θ.
  • Provide convergence analysis under assumptions of bounded policy gradient/Hessian, bounded gradient variance, and bounded importance weights variance.
  • Show that with appropriate choices of η, m, N, B, SRVR-PG achieves E[||G_η(θ_out)||^2] ≤ ε in O(1/ε^{3/2}) trajectories.

Experimental results

Research questions

  • RQ1Can SRVR-PG reduce the sample complexity of policy gradient methods for nonconvex performance functions compared to prior variance-reduced methods?
  • RQ2How does incorporating step-wise importance weighting and recursion affect convergence guarantees and sample complexity?
  • RQ3Does the SRVR-PG-PE variant with parameter-space exploration improve performance without increasing trajectory complexity?
  • RQ4What are the theoretical guarantees for Gaussian policies regarding horizon and discount factor dependencies?

Key findings

AlgorithmsComplexity
REINFORCE (Williams, 1992)O(1/ε^{2})
PGT (Sutton et al., 2000)O(1/ε^{2})
GPOMDP (Baxter & Bartlett, 2001)O(1/ε^{2})
SVRPG (Papini et al., 2018)O(1/ε^{2})
SVRPG (Xu et al., 2019)O(1/ε^{5/3})
SRVR-PG (This paper)O(1/ε^{3/2})
  • SRVR-PG achieves an ε-approximate stationary point with O(1/ε^{3/2}) trajectories, improving over O(1/ε^{5/3}) for prior SVRPG by a factor of O(1/ε^{1/6}).
  • The analysis yields an iteration complexity that avoids the O(1/B) term in some prior results and makes mini-batch size independent of horizon H.
  • For Gaussian policies, the method attains an O(1/ε^{3/2}) sample complexity with explicit dependencies on (1−γ) and H that do not involve horizon in the same way as some earlier work.
  • SRVR-PG-PE integrates parameter-based exploration and can perform better in practice on control tasks without increasing sample complexity.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.