QUICK REVIEW

[论文解读] What are the Statistical Limits of Offline RL with Linear Function Approximation?

Ruosong Wang, Dean P. Foster|arXiv (Cornell University)|Oct 22, 2020

Reinforcement Learning in Robotics参考文献 63被引用 37

一句话总结

论文证明在可实现性前提下，线性Q函数和有限特征覆盖，离线RL在评估任一策略时需要随时间步指数级样本；并分析为何在没有更强条件时，样本高效的离线策略评估不可能。

ABSTRACT

Offline reinforcement learning seeks to utilize offline (observational) data to guide the learning of (causal) sequential decision making strategies. The hope is that offline reinforcement learning coupled with function approximation methods (to deal with the curse of dimensionality) can provide a means to help alleviate the excessive sample complexity burden in modern sequential decision making problems. However, the extent to which this broader approach can be effective is not well understood, where the literature largely consists of sufficient conditions. This work focuses on the basic question of what are necessary representational and distributional conditions that permit provable sample-efficient offline reinforcement learning. Perhaps surprisingly, our main result shows that even if: i) we have realizability in that the true value function of \emph{every} policy is linear in a given set of features and 2) our off-policy data has good coverage over all features (under a strong spectral condition), then any algorithm still (information-theoretically) requires a number of offline samples that is exponential in the problem horizon in order to non-trivially estimate the value of \emph{any} given policy. Our results highlight that sample-efficient offline policy evaluation is simply not possible unless significantly stronger conditions hold; such conditions include either having low distribution shift (where the offline data distribution is close to the distribution of the policy to be evaluated) or significantly stronger representational conditions (beyond realizability).

研究动机与目标

评估可实现性和良好特征覆盖是否足以实现带线性函数逼近的离线RL的样本高效。
在线性设定下建立对离线策略评估的基本界限。
说明误差放大，并识别在哪些条件下可能实现高效。
提供关于在离线数据与可实现性下 LSPE 的行为洞察。

提出的方法

陈述并形式化一个困难性定理，在Assumptions 1 和 2 下，在地平线 H 上展示指数级样本复杂度。
构造一个具有线性 Q 函数和有界特征范数的困难 MDP 实例，以证明信息论界限。
分析 Least-Squares Policy Evaluation (LSPE)，以说明在离线数据下的几何误差放大。
引入 Assumption 3（低分布移位）并讨论其在使用 LSPE 时实现样本高效性的充分性。
从样本复杂度角度讨论离线 LSVE/LSVI 与它们的在线/离线对等物 (LSPI) 之间的关系。

实验结果

研究问题

RQ1可实现性加上良好的特征覆盖是否能保证离线策略评估在带线性函数近似下的多项式样本复杂度？
RQ2在何种分布或表示强化下，离线 RL 可以变得样本高效？
RQ3在可实现性假设下，使用 LSPE 的离线策略评估中误差如何传播（放大）？
RQ4哪些具体的困难实例能展示带线性函数近似的离线 RL 的极限？

主要发现

在所述假设下，任何算法都需要在地平线中指数级的样本量才能对任意给定策略的价值进行非平凡估计。
尽管在构造下是无偏估计，LSPE 在地平线 H 上的方差仍然呈指数级增长。
在离线、线性、可实现性设定下，离线 LSVI 与 LSPI 在样本复杂度方面存在指数级分离。
两个困难实例（稀疏奖励和确定性动力学）显示在奖励和转移估计误差上同時可能放大。
只有在如低分布移位或超越可实现性的更强表示假设等更强条件下，离线策略评估才可能实现样本高效。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。