QUICK REVIEW

[论文解读] Reinforcement Learning with General Value Function Approximation: Provably Efficient Approach via Bounded Eluder Dimension

Ruosong Wang, Ruslan Salakhutdinov|arXiv (Cornell University)|May 21, 2020

Reinforcement Learning in Robotics参考文献 67被引用 30

一句话总结

提出一种可证明高效的Q学习算法，用于具有通用值函数近似的强化学习（RL），其累积遗憾度依赖于 eluder 维数和函数类的对数覆盖数，并不假设特定的模型。它使用一个稳定的类似UCB的奖励项以及用于提高效率的数据子采样方案。

ABSTRACT

Value function approximation has demonstrated phenomenal empirical success in reinforcement learning (RL). Nevertheless, despite a handful of recent progress on developing theory for RL with linear function approximation, the understanding of general function approximation schemes largely remains missing. In this paper, we establish a provably efficient RL algorithm with general value function approximation. We show that if the value functions admit an approximation with a function class $\mathcal{F}$, our algorithm achieves a regret bound of $\widetilde{O}(\mathrm{poly}(dH)\sqrt{T})$ where $d$ is a complexity measure of $\mathcal{F}$ that depends on the eluder dimension [Russo and Van Roy, 2013] and log-covering numbers, $H$ is the planning horizon, and $T$ is the number interactions with the environment. Our theory generalizes recent progress on RL with linear value function approximation and does not make explicit assumptions on the model of the environment. Moreover, our algorithm is model-free and provides a framework to justify the effectiveness of algorithms used in practice.

研究动机与目标

动机并解决超出线性设置的通用值函数近似下的强化学习问题。
开发一个可证明高效的、基于模型无关的Q学习算法，适用于广义函数类F。
用 eluder 维数以及F及状态-动作空间的覆盖数来刻画该算法的遗憾度。

提出的方法

为Q函数近似定义一个通用函数类F，并假设Bellman后向兼容性：对于任意V，存在F中的f_V，使得f_V(s,a) = r(s,a) + ∑_{s'} P(s'|s,a) V(s').
通过对回放缓冲区进行最小二乘拟合，迭代地计算Q^k_h，并添加稳定的UCB型奖励项b^k_h以促进探索。
使用数据驱动的置信区间F^k_h及其宽度w(F^k_h, s,a)作为奖励项，确保在高概率下Q^k_h为上界估计。
通过基于重要性采样的灵敏度采样引入稳定性，以对子样本数据集并控制奖励的复杂度。
给出算法1（F-LSVI），含Q值与贪婪策略的构造，以及算法3（Bonus）用于生成稳定奖励项。
在假设1下，将遗憾度量化为关于 eluder 维数 dim_E(F, δ/T^3) 以及覆盖数 N(F, δ/T^2) 和 N(S×A, δ/T) 的函数。

实验结果

研究问题

RQ1在没有基于模型的假设下，具有通用函数近似的RL是否能够实现可证明的高效性？
RQ2如何用值函数类的eluder维数和覆盖数来支配一个模型无关、函数近似RL算法的遗憾度？
RQ3哪些实际机制（如稳定奖励项和数据子采样）能够确保探索和计算效率？
RQ4该方法与线性和广义线性函数近似的现有结果有何关系并如何推广？

主要发现

该算法得到的遗憾界与H^2T相关，并含有一个关于 dim_E(F, δ/T^3) 的复杂度项以及覆盖数的对数量对数项。
对于表格型RL，该界收敛到与现有表格结果相当的形式，同时承认由于普遍性导致的较差的朴素界。
当F为d维线性或广义线性时，dim_E(F, ε) = O(d log(1/ε)) 或类似界，导致遗憾度随相应复杂性项及对数项的增长。
该方法推广了线性函数近似的RL，并提供一个不假设显式环境动态的模型无关框架。
稳定奖励函数和灵敏度采样通过控制数据集规模并维持对Q的置信上界，实现计算上可行的实现。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。