QUICK REVIEW

[论文解读] Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

Dongruo Zhou, Quanquan Gu|arXiv (Cornell University)|Dec 15, 2020

Advanced Bandit Algorithms Research参考文献 65被引用 23

一句话总结

该论文提出UCRL-VTR⁺和UCLK⁺，一种基于新颖的自归一化鞅的伯恩斯坦型集中不等式，用于线性混合马尔可夫决策过程的计算高效强化学习算法。这些算法在周期性无折扣设置下达到几乎极小极大最优的遗憾界$ abilde{O}(dH\sqrt{T})$，在折扣设置下达到$ abilde{O}(d\sqrt{T}/(1-\gamma)^{1.5})$，与已知的下界相比仅相差对数因子。

ABSTRACT

We study reinforcement learning (RL) with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model (Jia et al., 2020; Ayoub et al., 2020; Zhou et al., 2020) and the learning agent has access to either an integration or a sampling oracle of the individual basis kernels. We propose a new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise. Based on the new inequality, we propose a new, computationally efficient algorithm with linear function approximation named $ ext{UCRL-VTR}^{+}$ for the aforementioned linear mixture MDPs in the episodic undiscounted setting. We show that $ ext{UCRL-VTR}^{+}$ attains an $ ilde O(dH\sqrt{T})$ regret where $d$ is the dimension of feature mapping, $H$ is the length of the episode and $T$ is the number of interactions with the MDP. We also prove a matching lower bound $Ω(dH\sqrt{T})$ for this setting, which shows that $ ext{UCRL-VTR}^{+}$ is minimax optimal up to logarithmic factors. In addition, we propose the $ ext{UCLK}^{+}$ algorithm for the same family of MDPs under discounting and show that it attains an $ ilde O(d\sqrt{T}/(1-γ)^{1.5})$ regret, where $γ\in [0,1)$ is the discount factor. Our upper bound matches the lower bound $Ω(d\sqrt{T}/(1-γ)^{1.5})$ proved by Zhou et al. (2020) up to logarithmic factors, suggesting that $ ext{UCLK}^{+}$ is nearly minimax optimal. To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.

研究动机与目标

为大型MDP的线性函数近似强化学习中，填补上界与下界之间的差距。
开发一种计算高效的算法，实现在周期性无折扣设置下几乎极小极大最优的遗憾。
将该方法扩展到折扣设置，并推导出与已知下界在对数因子内匹配的遗憾界。
建立一种针对向量值鞅的新伯恩斯坦型集中不等式，其优于现有的自归一化边界。
证明所提出的算法在具有积分或采样预言机访问权限的线性混合MDP假设下，达到几乎极小极大最优的遗憾。

提出的方法

提出一种新颖的针对向量值鞅的伯恩斯坦型自归一化集中不等式，将噪声依赖性从$R\sqrt{d}$改进为$\sigma\sqrt{d}+R$。
将新不等式应用于设计UCRL-VTR⁺，用于周期性无折扣MDP，用更紧的伯恩斯坦型置信集替代霍夫丁型边界。
通过将相同不等式适配到UCLK框架，设计UCLK⁺用于折扣MDP，确保计算效率。
利用积分或采样预言机，高效计算线性混合MDP中的置信集和策略更新。
将遗憾分解为估计误差和优化误差项，并通过新集中不等式和自归一化鞅技术进行有界化。
利用新不等式建立真实参数向量的高概率置信集，从而实现更紧的遗憾分析。

实验结果

研究问题

RQ1是否存在一种计算高效的强化学习算法，可在线性混合MDP中实现几乎极小极大最优的遗憾？
RQ2在线性Bandit和强化学习设置中，向量值鞅的伯恩斯坦型集中不等式是否相比霍夫丁型边界能改善遗憾界？
RQ3UCRL-VTR⁺在周期性无折扣设置下的遗憾是否在对数因子内最优？
RQ4该方法是否可扩展至折扣设置，并获得匹配的遗憾界？
RQ5所提出的算法是否在对数因子内达到已知极小极大下界？

主要发现

UCRL-VTR⁺在周期性无折扣设置下达到$\nabilde{O}(dH\sqrt{T})$的遗憾界，与已知的$\Omega(dH\sqrt{T})$下界相比仅相差对数因子。
UCLK⁺在折扣设置下达到$\nabilde{O}(d\sqrt{T}/(1-\gamma)^{1.5})$的遗憾界，与$\Omega(d\sqrt{T}/(1-\gamma)^{1.5})$下界相比仅相差对数因子。
所提出的伯恩斯坦型集中不等式将噪声依赖性从$R\sqrt{d}$改进为$\sigma\sqrt{d}+R$，为线性函数近似提供了更紧的置信边界。
在假设可访问基核的积分或采样预言机的前提下，算法具有计算高效性。
遗憾分析表明，边界中的主导项随$\sqrt{T}$增长，证实了样本效率的近似最优性。
本研究首次建立了计算高效、几乎极小极大最优的强化学习算法，适用于线性混合MDP中的线性函数近似。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。