QUICK REVIEW

[论文解读] Finite-Sample Analysis of Nonlinear Stochastic Approximation with Applications in Reinforcement Learning

Zaiwei Chen, Sheng Zhang|arXiv (Cornell University)|May 27, 2019

Reinforcement Learning in Robotics参考文献 45被引用 31

一句话总结

该论文在马尔可夫噪声下为非线性随机逼近（SA）提供了有限样本收敛边界，建立了常数步长下的指数收敛以及递减步长下的$O(\log k / k)$收敛速率。该结果被应用于具有线性函数逼近的$Q$-学习，推导出在行为策略、折扣因子和基函数之间新条件下的首个有限样本边界，并在Baird的反例上进行了数值验证。

ABSTRACT

Motivated by applications in reinforcement learning (RL), we study a nonlinear stochastic approximation (SA) algorithm under Markovian noise, and establish its finite-sample convergence bounds under various stepsizes. Specifically, we show that when using constant stepsize (i.e., $α_k\equiv α$), the algorithm achieves exponential fast convergence to a neighborhood (with radius $O(α\log(1/α))$) around the desired limit point. When using diminishing stepsizes with appropriate decay rate, the algorithm converges with rate $O(\log(k)/k)$. Our proof is based on Lyapunov drift arguments, and to handle the Markovian noise, we exploit the fast mixing of the underlying Markov chain. To demonstrate the generality of our theoretical results on Markovian SA, we use it to derive the finite-sample bounds of the popular $Q$-learning with linear function approximation algorithm, under a condition on the behavior policy. Importantly, we do not need to make the assumption that the samples are i.i.d., and do not require an artificial projection step in the algorithm to maintain the boundedness of the iterates. Numerical simulations corroborate our theoretical results.

研究动机与目标

为在强化学习（RL）中常见但有限样本分析中研究不足的马尔可夫噪声下的非线性随机逼近（SA）建立有限样本收敛保证。
通过李雅普诺夫漂移和马尔可夫链的几何混合性，证明有界性，从而消除SA算法中对人工投影步骤的需求。
将SA结果应用于具有线性函数逼近的$Q$-学习，提供在稳定性充分条件下的首个有限样本收敛边界。
通过Baird的著名发散反例，数值验证所推导条件的充分性及收敛速率。

提出的方法

作者使用李雅普诺夫漂移论证分析到最优解距离的期望减小，构造合适的李雅普诺夫函数以确保期望下为负漂移。
利用底层马尔可夫链的几何混合性控制噪声中的依赖性，实现在马尔可夫采样下的有限样本边界。
对于常数步长，该方法显示收敛到极限点附近半径为$O(\alpha \log(1/\alpha))$的邻域内。
对于递减步长$\alpha_k = \alpha / (k + h)^\xi$，该方法推导出$O(\log k / k)$的收敛速率，当$\xi = 1$时达到最优速率。
通过将更新建模为具有马尔可夫噪声的非线性SA，将该方法应用于具有线性函数逼近的$Q$-学习。
推导出一个收敛的充分条件，涉及行为策略$\pi$、折扣因子$\gamma$和基函数，通过$\omega(\pi) > \gamma^2$形式化。

实验结果

研究问题

RQ1能否在不依赖i.i.d.样本或人工投影的情况下，为马尔可夫噪声下的非线性SA建立有限样本收敛边界？
RQ2在非线性马尔可夫SA中，常数步长下可实现的收敛速率是什么？能否证明指数收敛？
RQ3在已知一般情况下会发散的$Q$-学习中，具有线性函数逼近的有限样本收敛需要满足什么条件？
RQ4理论收敛速率与实际性能相比如何，特别是在已知的发散案例（如Baird的反例）中？
RQ5所推导的稳定性条件能否在数值上验证，并在实际中证明其充分性？

主要发现

对于常数步长，非线性SA算法在最优解附近半径为$O(\alpha \log(1/\alpha))$的邻域内实现指数收敛。
对于递减步长$\alpha_k = \alpha / (k + h)^\xi$，算法以$O(\log k / k)$的速率收敛，当$\xi = 1$时达到最优速率。
所提出的条件$\omega(\pi) > \gamma^2$确保了具有线性函数逼近的$Q$-学习的有限样本收敛，其中$\omega(\pi)$量化了行为策略在探索基函数变化方面的能力。
数值实验表明，当$\gamma = 0.7$时，算法呈指数收敛；而当$\gamma = 0.97$时，算法发散，验证了该条件的充分性。
对于递减步长，经验收敛速率与理论$O(\log k / k)$速率一致，$\log \mathbb{E}[\|\theta_k - \theta^*\|^2]$与$\log k$的斜率约为$-\xi$，证实了理论速率。
当条件满足时，该方法在Baird的反例中成功稳定了$Q$-学习，展示了理论边界的实际相关性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。