QUICK REVIEW

[论文解读] Learning Near Optimal Policies with Low Inherent Bellman Error

Andrea Zanette, Alessandro Lazaric|arXiv (Cornell University)|Feb 29, 2020

Advanced Bandit Algorithms Research参考文献 47被引用 38

一句话总结

本论文介绍 Eleanor，一种基于乐观的 LSVI 的情境强化学习算法，使用线性值函数近似，在内在 Bellman 误差较低的情况下，证明近似最优的后悔界并匹配下界，并在 H=1 时在处理模型误差时展示对 LinUCB 的恢复。

ABSTRACT

We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning under the notion of low inherent Bellman error, a condition normally employed to show convergence of approximate value iteration. First we relate this condition to other common frameworks and show that it is strictly more general than the low rank (or linear) MDP assumption of prior work. Second we provide an algorithm with a high probability regret bound $\widetilde O(\sum_{t=1}^H d_t \sqrt{K} + \sum_{t=1}^H \sqrt{d_t} \IBE K)$ where $H$ is the horizon, $K$ is the number of episodes, $\IBE$ is the value if the inherent Bellman error and $d_t$ is the feature dimension at timestep $t$. In addition, we show that the result is unimprovable beyond constants and logs by showing a matching lower bound. This has two important consequences: 1) it shows that exploration is possible using only \emph{batch assumptions} with an algorithm that achieves the optimal statistical rate for the setting we consider, which is more general than prior work on low-rank MDPs 2) the lack of closedness (measured by the inherent Bellman error) is only amplified by $\sqrt{d_t}$ despite working in the online setting. Finally, the algorithm reduces to the celebrated extsc{LinUCB} when $H=1$ but with a different choice of the exploration parameter that allows handling misspecified contextual linear bandits. While computational tractability questions remain open for the MDP setting, this enriches the class of MDPs with a linear representation for the action-value function where statistically efficient reinforcement learning is possible.

研究动机与目标

在低内在 Bellman 误差（IBE）下，利用近似线性行动-价值函数来激励探索。
澄清 IBE 如何与低秩 MDP 与 LSPI 条件相关，并展示更广泛的适用性。
开发一种乐观的、全局优化的 LSVI 风格算法（Eleanor），保持 Q 函数的线性性。
建立信息理论上紧致的后悔保证，并演示对错配（misspecified）情景下的情境线性设定的含义。

提出的方法

为线性 Q 函数类定义固有 Bellman 误差（IBE），并将其与线性和低秩 MDP 框架联系起来。
通过求解一个规划优化程序，将 LSVI 扩展到乐观设置，在该程序中在整个时间范围内联合选择 theta_t 和乐观扰动。
引入一个在参数空间中具有椭球约束的全局最优扰动 bar_t over theta，保持线性并实现紧致的置信界。
Derive a regret bound R(T) = 兹肉公式，其中 I 是固有 Bellman 误差。
展示在 H=1 时 Eleanor 如何收敛为 LinUCB，并引入修正的探索参数以处理错配。
讨论计算方面的考虑以及与情境错配线性带宽的联系。

实验结果

研究问题

RQ1在在线情境强化学习中，是否能在低内在 Bellman 误差下使用线性 Q 函数类有效地进行探索？
RQ2固有 Bellman 误差如何与低秩 MDP 及 LSPI 条件相关并超越它们？
RQ3维持线性且处理错配的乐观 LSVI 型算法的后悔保证是什么？
RQ4该方法在特殊情形（如 H=1）下是否恢复已知结果（如 LinUCB），以及错配如何影响界？

主要发现

Eleanor achieves a regret bound of 兹 sum_{t=1}^H d_t sqrt{K} ʺsum_{t=1}^H sqrt{d_t} I K（up to polylog factors）。
The inherent Bellman error framework is strictly more general than the low-rank MDP assumption and can handle misspecification with a sqrt{d_t} amplification of IBE.
The result is unimprovable up to constants and logs, demonstrated via a matching lower bound for the setting without misspecification.
When H=1, Eleanor reduces to LinUCB with a modified exploration parameter to accommodate misspecification in contextual linear bandits.
The analysis extends to low-rank MDPs, improving prior bounds by a factor of the square root of the feature dimension, and provides a principled way to manage misspecification in online settings.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。