QUICK REVIEW

[论文解读] On Reward-Free Reinforcement Learning with Linear Function Approximation

Ruosong Wang, Simon S. Du|arXiv (Cornell University)|Jun 19, 2020

Reinforcement Learning in Robotics参考文献 36被引用 33

一句话总结

论文分析带线性函数逼近的无奖励强化学习，在线性 MDP 下证明多项式上界，在线性 Q* 下证明指数下界，并给出模型基与值基假设之间的显性困难分离。

ABSTRACT

Reward-free reinforcement learning (RL) is a framework which is suitable for both the batch RL setting and the setting where there are many reward functions of interest. During the exploration phase, an agent collects samples without using a pre-specified reward function. After the exploration phase, a reward function is given, and the agent uses samples collected during the exploration phase to compute a near-optimal policy. Jin et al. [2020] showed that in the tabular setting, the agent only needs to collect polynomial number of samples (in terms of the number states, the number of actions, and the planning horizon) for reward-free RL. However, in practice, the number of states and actions can be large, and thus function approximation schemes are required for generalization. In this work, we give both positive and negative results for reward-free RL with linear function approximation. We give an algorithm for reward-free RL in the linear Markov decision process setting where both the transition and the reward admit linear representations. The sample complexity of our algorithm is polynomial in the feature dimension and the planning horizon, and is completely independent of the number of states and actions. We further give an exponential lower bound for reward-free RL in the setting where only the optimal $Q$-function admits a linear representation. Our results imply several interesting exponential separations on the sample complexity of reward-free RL.

研究动机与目标

调查带线性函数逼近的无奖励RL是否能实现可证明的高效性。
在两种建模假设下刻画样本复杂度：线性 MDP 与线性 Q*。
探索困难性结果，以理解带函数近似的无奖励 RL 的局限性。
提供关于无奖励RL中模型基与值基设置之分离的见解。

提出的方法

提出一种在线性 MDP 中的无奖励 RL 算法，在探索阶段收集 1/B(d^3 H^6 / 8 B^2) 条轨迹。
在探索阶段使用以探索为驱动的上置信界（UCB）奖励来构建探索奖励。
利用最小二乘值迭代（LSVI）来估计 Q 函数并推导乐观值函数。
规划阶段使用数据集进行批处理RL，采用对观测到的转移的均匀覆盖和一个乐观的 Q 函数。
通过集中性论证和椭圆势引理证明乐观规划保证。
在线性 Q* 假设下证明无奖励 RL 的下界，表明探索阶段的样本复杂度呈指数级。

实验结果

研究问题

RQ1在线性 MDP 假设下，是否可以用线性函数逼近高效求解无奖励 RL？
RQ2仅当只有最优 Q* 线性（线性 Q*）时，无奖励 RL 的样本复杂度是多少，与线性 MDP 设置相比如何？
RQ3在较弱假设下是否存在困难性结果，模型基与值基设置之间是否出现指数分离？
RQ4模拟器（生成模型）的存在如何影响样本复杂度，相对于无奖励 RL 的标准 RL 模型？
RQ5在线性函数逼近下，无奖励 RL 与标准 RL 之间出现了哪些概念上的分离？

主要发现

在在线性 MDP 下，奖励无关 RL 实现多项式探索复杂度，从源: O(d^3 H^6 / ^2) 条轨迹，产生 -最优策略用于规划中的任意奖励函数，且概率很高。
构造一个探索驱动的奖励函数 r_h^k，表示为 u_h^k / H，以激励访问不确定的状态-动作对。
如果只有 Q* 是线性的（线性 Q*），任何无奖励 RL 算法在探索阶段都需要指数级样本以保证接近最优的规划，即使在确定性 MDP 中。
在线性 Q* 假设下，存在无奖励 RL 与标准 RL 之间的指数分离，因为标准 RL 在充分条件下可以实现多项式规划。
有一个带模拟器时，在线性 Q* 下存在多项式上界，指示模拟器与非模拟器设置之间的指数分离。
结果表明，在同一函数近似范围内，无奖励 RL 的难度可能指数级地高于标准 RL，且模拟器可显著降低样本复杂度。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。