QUICK REVIEW

[论文解读] SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation

Bo Dai, Albert Shaw|arXiv (Cornell University)|Dec 29, 2017

Adaptive Dynamic Programming Control被引用 120

一句话总结

SBEED 将 Bellman 方程重新表述为一个带平滑的原-对偶鞘点问题，从而使使用非线性函数近似器（如神经网络）的强化学习收敛。它提供收敛性保证和在连续控制任务中的有利经验结果。

ABSTRACT

When function approximation is used, solving the Bellman optimality equation with stability guarantees has remained a major open problem in reinforcement learning for decades. The fundamental difficulty is that the Bellman operator may become an expansion in general, resulting in oscillating and even divergent behavior of popular algorithms like Q-learning. In this paper, we revisit the Bellman equation, and reformulate it into a novel primal-dual optimization problem using Nesterov's smoothing technique and the Legendre-Fenchel transformation. We then develop a new algorithm, called Smoothed Bellman Error Embedding, to solve this optimization problem where any differentiable function class may be used. We provide what we believe to be the first convergence guarantee for general nonlinear function approximation, and analyze the algorithm's sample complexity. Empirically, our algorithm compares favorably to state-of-the-art baselines in several benchmark control problems.

研究动机与目标

解决带有非线性函数近似器的 Bellman 基方法的不稳定性和发散性。
引入平滑的 Bellman 算子以实现稳定优化。
开发一个避免双重采样问题并支持离策略学习的原-对偶目标。
为非线性函数近似提供收敛性保证和样本复杂度分析。
在基准控制问题上展示经验性能。
扩展至连续和离散动作空间，并将值函数估计与策略优化统一起来。

提出的方法

将 Bellman 方程重新表述为一个经过 Nesterov 平滑的带熵正则化的最大值，从而得到一个收缩算子并具有唯一的不动点。
推导一个结合值函数 V、策略 π 与对偶变量 ν（或 ρ）的原-对偶目标，使优化不需要非光滑的最大算子。
利用 Fenchel 对偶性将平方贝尔曼误差转化为鞍点问题，以避免双重采样问题。
引入一个两人博弈（极小极大）目标 Lη(V,π;ρ)，在平方贝尔曼残差和一个抵消方差的对偶项之间权衡。
开发一个随机镜像下降算法（SBEED），用于更新 V 和 π 的非线性函数近似，并求解对偶更新。
在收敛到驻点、泛化界线以及包含平滑偏差和近似误差的显式误差分解方面提供理论保证。

实验结果

研究问题

RQ1当解决贝尔曼方程时，非线性函数近似器是否能实现离策略 RL 的收敛？
RQ2对贝尔曼算子进行平滑并采用原-对偶形式是否能在神经网络下实现稳定性和收敛？
RQ3与最先进的基线相比，所提出的 SBEED 框架在连续控制任务中的样本效率和鲁棒性如何？
RQ4平滑参数对偏差-方差权衡和实际总体误差有何影响？
RQ5该方法是否能够在统一目标下处理连续和离散动作空间？

主要发现

SBEED 为一般非线性函数近似在离策略 RL 中提供收敛性保证。
平滑后的贝尔曼算子仍然是收缩的，确保 λ>0 时存在唯一的不动点 Vλ*。
可处理的原-对偶形式避免双重采样问题，并支持随机梯度更新。
该算法在带有神经网络的情况下实现稳定学习，并在连续控制基准上显示出有利的经验表现。
明确的误差分解突出平滑偏差、近似误差和统计误差，随着 λ→0 且数据增长，收敛到 V*。
SBEED 将值估计与策略优化统一起来，并支持多步引导和资格迹。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。