QUICK REVIEW

[论文解读] Sample-Optimal Parametric Q-Learning Using Linearly Additive Features

Lin F. Yang, Mengdi Wang|arXiv (Cornell University)|Feb 13, 2019

Reinforcement Learning in Robotics参考文献 32被引用 32

一句话总结

该论文提出了一种针对具有线性可加状态-动作特征的马尔可夫决策过程（MDPs）的样本最优参数化Q-learning算法。通过利用单调性、方差缩减以及在锚点状态-动作假设下的置信区间，该方法实现了 $×\widetilde{O}(K/\epsilon^2(1-\gamma)^3)$ 的样本复杂度，与理论下限仅在对数因子范围内存在差异，因此对于大规模MDP而言近乎样本最优。

ABSTRACT

Consider a Markov decision process (MDP) that admits a set of state-action features, which can linearly express the process's probabilistic transition model. We propose a parametric Q-learning algorithm that finds an approximate-optimal policy using a sample size proportional to the feature dimension $K$ and invariant with respect to the size of the state space. To further improve its sample efficiency, we exploit the monotonicity property and intrinsic noise structure of the Bellman operator, provided the existence of anchor state-actions that imply implicit non-negativity in the feature space. We augment the algorithm using techniques of variance reduction, monotonicity preservation, and confidence bounds. It is proved to find a policy which is $ε$-optimal from any initial state with high probability using $\widetilde{O}(K/ε^2(1-γ)^3)$ sample transitions for arbitrarily large-scale MDP with a discount factor $γ\in(0,1)$. A matching information-theoretical lower bound is proved, confirming the sample optimality of the proposed method with respect to all parameters (up to polylog factors).

研究动机与目标

为通过结构化特征表示解决大规模MDP中的维度灾难问题。
确定在高概率下学习一个 $\epsilon$-最优策略所需的最少状态转移样本数量。
开发一种可证明样本高效的Q-learning算法，其规模仅与特征维度 $K$ 相关，而非状态空间大小。
建立紧致的信息论下限，并在多对数因子范围内与算法性能相匹配。

提出的方法

采用参数化Q-learning框架，通过采样转移更新参数，避免函数拟合。
利用方差缩减和置信区间以在值迭代过程中保持紧密的误差控制。
利用贝尔曼算子的单调性以及锚点状态-动作假设，确保策略改进。
应用带有小批量采样的递归置信区域更新，以加速收敛。
采用分层参数更新机制，结合指数递减的误差界，以确保单调性改进。
依赖一种新颖的分析方法，结合马尔可夫链的全期望定律与集中不等式，以界定估计误差。

实验结果

研究问题

RQ1在基于特征的MDP中，学习一个 $\epsilon$-最优策略所需的样本数的信息论下限是什么？
RQ2Q-learning算法能否实现与状态空间大小无关的样本复杂度，而仅依赖于特征维度 $K$？
RQ3如何利用单调性和方差缩减来提升参数化Q-learning中的样本效率？
RQ4在具有线性可加特征模型的参数化Q-learning中，是否存在一种可证明的样本最优算法？

主要发现

所提出的算法实现了 $\widetilde{O}(K/\epsilon^2(1-\gamma)^3)$ 的样本复杂度，与信息论下限仅在对数因子范围内存在差异。
该算法以至少 $1-\delta$ 的概率，从任意初始状态找到一个 $\epsilon$-最优策略，所需样本数为 $\widetilde{O}(K/\epsilon^2(1-\gamma)^3 \cdot \log(1/\delta))$。
当 $\gamma = 0.99$ 时，加速后的算法比基本的参数化Q-learning基线快 $10^8$ 倍。
该方法是首个在具有线性转移模型的MDP中实现样本最优性（在多对数因子范围内）的方法。
锚点状态-动作假设使得特征空间中保持非负性，这对单调策略改进和紧密的误差控制至关重要。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。