QUICK REVIEW

[论文解读] Agnostic Q-learning with Function Approximation in Deterministic Systems: Tight Bounds on Approximation Error and Sample Complexity

Simon S. Du, Jason D. Lee|arXiv (Cornell University)|Feb 17, 2020

Advanced Bandit Algorithms Research参考文献 34被引用 23

一句话总结

该论文提出了一种基于递归的Q-learning算法，结合函数逼近，在确定性MDP中实现了对抗设置下的最优样本复杂度。它建立了紧致的边界，表明当近似误差δ为O(ρ/√dim_E)时，该算法仅需O(dim_E)条轨迹即可找到最优策略，从而解决了对抗强化学习中函数逼近的一个开放问题。

ABSTRACT

The current paper studies the problem of agnostic $Q$-learning with function approximation in deterministic systems where the optimal $Q$-function is approximable by a function in the class $\mathcal{F}$ with approximation error $δ\ge 0$. We propose a novel recursion-based algorithm and show that if $δ= O\left(ρ/\sqrt{\dim_E} ight)$, then one can find the optimal policy using $O\left(\dim_E ight)$ trajectories, where $ρ$ is the gap between the optimal $Q$-value of the best actions and that of the second-best actions and $\dim_E$ is the Eluder dimension of $\mathcal{F}$. Our result has two implications: 1) In conjunction with the lower bound in [Du et al., ICLR 2020], our upper bound suggests that the condition $δ= \widetildeΘ\left(ρ/\sqrt{\mathrm{dim}_E} ight)$ is necessary and sufficient for algorithms with polynomial sample complexity. 2) In conjunction with the lower bound in [Wen and Van Roy, NIPS 2013], our upper bound suggests that the sample complexity $\widetildeΘ\left(\mathrm{dim}_E ight)$ is tight even in the agnostic setting. Therefore, we settle the open problem on agnostic $Q$-learning proposed in [Wen and Van Roy, NIPS 2013]. We further extend our algorithm to the stochastic reward setting and obtain similar results.

研究动机与目标

为解决在确定性MDP中，对抗设置下结合函数逼近的Q-learning算法的可证明高效性这一开放问题。
刻画近似误差δ和最优性差距ρ的必要与充分条件，以实现多项式样本复杂度。
建立样本复杂度的紧致上下界，表明在给定条件下Θ(dim_E)为最优。
将分析扩展至随机奖励设置，同时保持类似的保证。

提出的方法

该算法采用基于递归的方法，通过一个基于不确定性和近似误差选择动作的预言机，逐步构建状态-动作-值对的数据集Y。
它使用最大不确定性预言机来引导探索，确保Q值估计中潜在误差较高的动作优先被选择。
该算法维护一个已观测到的状态-动作-值对集合Y，并在函数类F上使用最小二乘回归来估计Q函数f。
它基于估计Q值与真实Q值之间的偏差，定义探索循环中的停止条件，确保收敛至与最优Q值相差不超过ρ/2的策略。
分析依赖于Eluder维数dim_E(F, ρ/4)作为函数类复杂度的度量，直接将其与样本复杂度关联。
通过在MDP的层次水平上进行归纳，推导出理论保证，证明估计的Q函数f在所有状态下与Q*的误差在ρ/2以内，从而实现最优策略的恢复。

实验结果

研究问题

RQ1在确定性系统中，对抗Q-learning结合函数逼近实现多项式样本复杂度的最小近似误差δ是多少？
RQ2在确定性MDP中，对抗Q-learning结合函数逼近的样本复杂度O(dim_E)是否紧致？
RQ3能否在不假设最优Q函数可精确线性化的情况下，设计出一个可证明高效的算法用于对抗设置？
RQ4最优性差距ρ如何与近似误差δ和Eluder维数共同决定样本复杂度？
RQ5所提出的算法在随机奖励环境中是否仍保持样本效率？

主要发现

当δ = O(ρ / √dim_E)时，该算法仅使用O(dim_E)条轨迹即可找到最优策略，建立了紧致的样本复杂度边界。
δ = O(ρ / √dim_E)条件在多项式样本复杂度下既是必要也是充分的，与先前工作的匹配下界一致。
即使在对抗设置下，样本复杂度Θ(dim_E)也是紧致的，从而解决了Wen和Van Roy（2013）提出的一个开放问题。
在假设ρ ≥ 6√2 δ √dim_E(F, ρ/4)的条件下，该算法实现了O(dim_E)的样本复杂度，确保了最优与次优动作之间的充分分离。
分析可扩展至随机奖励设置，同时保持相似的样本复杂度和近似保证。
使用Eluder维数作为复杂度度量，使得近似误差与样本效率之间的权衡得以精确刻画。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。