Skip to main content
QUICK REVIEW

[论文解读] Provably Efficient Reinforcement Learning with Linear Function Approximation

Chi Jin, Zhuoran Yang|arXiv (Cornell University)|Jul 11, 2019
Advanced Bandit Algorithms Research被引用 219
一句话总结

这篇论文提出了在一个线性 MDP 设置中,首个可证明有效、具有多项式运行时间和样本复杂度的 RL 算法,达到与状态和动作无关的后悔界~O~(d^3 H^3 T)^{1/2}。它使用带有 UCB 奖励的乐观 LSVI,并在小模型失配下保持鲁棒。

ABSTRACT

Modern Reinforcement Learning (RL) is commonly applied to practical problems with an enormous number of states, where function approximation must be deployed to approximate either the value function or the policy. The introduction of function approximation raises a fundamental set of challenges involving computational and statistical efficiency, especially given the need to manage the exploration/exploitation tradeoff. As a result, a core RL question remains open: how can we design provably efficient RL algorithms that incorporate function approximation? This question persists even in a basic setting with linear dynamics and linear rewards, for which only linear function approximation is needed. This paper presents the first provable RL algorithm with both polynomial runtime and polynomial sample complexity in this linear setting, without requiring a "simulator" or additional assumptions. Concretely, we prove that an optimistic modification of Least-Squares Value Iteration (LSVI)---a classical algorithm frequently studied in the linear setting---achieves $ ilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret, where $d$ is the ambient dimension of feature space, $H$ is the length of each episode, and $T$ is the total number of steps. Importantly, such regret is independent of the number of states and actions.

研究动机与目标

  • Motivate the design of provably efficient RL algorithms that use function approximation without relying on simulators or strong assumptions.
  • Study the linear MDP setting where transitions and rewards are linear in a feature map, and establish regret and sample complexity guarantees.
  • Develop and analyze an algorithm that attains sublinear regret independent of the state and action space sizes.

提出的方法

  • Adopt an optimistic modification of Least-Squares Value Iteration (LSVI) with Upper-Confidence Bounds (UCB).
  • Represent Q_h as a linear function of features: Q_h(x,a)=w_h^T φ(x,a).
  • Update w_h via regularized least squares using observed rewards and next-value estimates.
  • Incorporate a UCB bonus β(φ^T Λ_h^{-1} φ)^{1/2} to encourage exploration, with Λ_h the Gram matrix.
  • Prove that with appropriate λ and β, the total regret is Õ(d^3 H^3 T) under Assumption A (linear MDP).
  • Show robustness to ζ-approximate linear MDPs with an additive regret term Õ(ζ d H T).

实验结果

研究问题

  • RQ1Can we design RL algorithms with polynomial runtime and sample complexity when function approximation is used, without simulators or restrictive assumptions?
  • RQ2Does a linear MDP structure suffice to guarantee sublinear regret independent of the size of the state and action spaces?
  • RQ3How does misspecification (ζ-approximate linear MDP) affect regret and learning guarantees?

主要发现

  • The proposed LSVI-UCB algorithm achieves regret Õ(d^3 H^3 T) with high probability, independent of S and A.
  • The algorithm runs in O(d^2 A K T) time and O(d^2 H + d A T) space, also independent of S and A.
  • Under ζ-approximate linear MDPs, regret becomes Õ(d^3 H^3 T) + Õ(ζ d H T √log), i.e., an additive term linear in T due to misspecification.
  • The results include a PAC-style guarantee: ε-optimal policy can be learned with Õ(d^3 H^4 / ε^2) samples when the initial state is fixed.
  • The method provides a bridge between tabular RL and function-approximation RL by achieving sublinear regret without simulators.
  • The analysis introduces a value-aware uniform concentration and a bridge between empirical and true transition measures using a linear structure.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。