QUICK REVIEW

[论文解读] An Improved Convergence Analysis of Stochastic Variance-Reduced Policy Gradient

Pan Xu, Felicia Gao|arXiv (Cornell University)|May 29, 2019

Reinforcement Learning in Robotics参考文献 25被引用 33

一句话总结

本文提供对 SVRPG 的更紧的收敛分析，表明其在 O(1/epsilon^{5/3}) 条轨迹下达到 epsilon-近似的驻点，相较 O(1/epsilon^2) 有所改进。

ABSTRACT

We revisit the stochastic variance-reduced policy gradient (SVRPG) method proposed by Papini et al. (2018) for reinforcement learning. We provide an improved convergence analysis of SVRPG and show that it can find an $\\epsilon$-approximate stationary point of the performance function within $O(1/\\epsilon^{5/3})$ trajectories. This sample complexity improves upon the best known result $O(1/\\epsilon^2)$ by a factor of $O(1/\\epsilon^{1/3})$. At the core of our analysis is (i) a tighter upper bound for the variance of importance sampling weights, where we prove that the variance can be controlled by the parameter distance between different policies; and (ii) a fine-grained analysis of the epoch length and batch size parameters such that we can significantly reduce the number of trajectories required in each iteration of SVRPG. We also empirically demonstrate the effectiveness of our theoretical claims of batch sizes on reinforcement learning benchmark tasks.

研究动机与目标

在强化学习中激励并分析随机方差减少的策略梯度（SVRPG）。
给出比前人工作更紧的 SVRPG 收敛界。
展示重要性抽样权重的方差如何可由策略距离控制，以及 epoch/batch 选择如何影响样本复杂度。
在标准 RL 基准任务（Cartpole、Mountain Car）上展示经验有效性。

提出的方法

重新审视将 SVRG 与策略梯度估计 (REINFORCE/GPOMDP) 相结合的 SVRPG 框架。
推导非平稳轨迹分布下重要性抽样权重的更紧方差界。
对 epoch 长度和批量大小进行细化分析，以减少每次迭代所需的轨迹数量。
证明 SVRPG 在 O(1/ε^{5/3}) 条轨迹下达到 E[||∇J(θ_out)||^2] ≤ ε。
给出推论，将步长、批量大小和 epoch 长度与总样本复杂度联系起来。
在 RL 基准 Cartpole 和 Mountain Car 上通过经验验证批量大小的选择。

实验结果

研究问题

RQ1在样本复杂度方面，SVRPG 是否能被严格证明比普通的随机策略梯度方法更快？
RQ2在非平稳采样下，SVRPG 的重要性权重的紧方差界是多少？
RQ3应如何选择 epoch 长度和批量大小以在保持收敛性的同时最小化轨迹需求？
RQ4这些理论改进是否转化为在标准 RL 任务上的实际收益？

主要发现

方法	复杂度
SG	O(1/ε^2)
SVRPG (Papini et al., 2018)	O(1/ε^2)
SVRPG (This paper)	O(1/ε^{5/3})

SVRPG 可以在 O(1/ε^{5/3}) 条轨迹下找到一个 ε-近似的驻点。
相比已知的 O(1/ε^{2}) 轨迹复杂度提升了一个因子 O(1/ε^{1/3})。
一个更紧的上界表明重要性抽样权重的方差可以通过策略之间的参数距离来控制。
对 epoch-batch 调度的改进在不降低收敛速度的情况下减少了每次迭代所需的轨迹数量。
在 Cartpole 与 Mountain Car 的经验实验支持所提出的批量大小选择的理论优势。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。