QUICK REVIEW

[论文解读] Model-Based Reinforcement Learning with Value-Targeted Regression

Alex Ayoub, Zeyu Jia|arXiv (Cornell University)|Jun 1, 2020

Advanced Bandit Algorithms Research参考文献 44被引用 71

一句话总结

本文提出 UCRL-VTR，一种使用价值目标回归来构建置信集并进行乐观规划的基于模型的 RL 算法，得到的后悔界随模型复杂度而非状态/动作空间规模变化，包括对线性混合的界限。

ABSTRACT

This paper studies model-based reinforcement learning (RL) for regret minimization. We focus on finite-horizon episodic RL where the transition model $P$ belongs to a known family of models $\mathcal{P}$, a special case of which is when models in $\mathcal{P}$ take the form of linear mixtures: $P_θ = \sum_{i=1}^{d} θ_{i}P_{i}$. We propose a model based RL algorithm that is based on optimism principle: In each episode, the set of models that are `consistent' with the data collected is constructed. The criterion of consistency is based on the total squared error of that the model incurs on the task of predicting \emph{values} as determined by the last value estimate along the transitions. The next value function is then chosen by solving the optimistic planning problem with the constructed set of models. We derive a bound on the regret, which, in the special case of linear mixtures, the regret bound takes the form $ ilde{\mathcal{O}}(d\sqrt{H^{3}T})$, where $H$, $T$ and $d$ are the horizon, total number of steps and dimension of $θ$, respectively. In particular, this regret bound is independent of the total number of states or actions, and is close to a lower bound $Ω(\sqrt{HdT})$. For a general model family $\mathcal{P}$, the regret bound is derived using the notion of the so-called Eluder dimension proposed by Russo & Van Roy (2014).

研究动机与目标

在已知转移模型族 P 的前提下，动机是在线模型基 RL 的后悔最小化。
提出价值目标回归，为 P 构建数据一致的置信集。
开发一个基于乐观规划的算法（UCRL-VTR），利用这些置信集。
给出理论上的后悔界并在经验上评估该方法。

提出的方法

定义具有已知模型族 P 的分段回合 MDP，并考虑线性混合模型 P = sum_j θ_j P_j。
引入价值目标回归，基于预测值 V_{h+1,k} 与观测目标 y_{h,k} 构造回归损失 L_{k+1}(P, P̂_{k+1}) 。
从回归损失构建置信集合 B_k，即 B_{k+1} = {P' ∈ P : L_{k+1}(P', P̂_{k+1}) ≤ β_{k+1}} 。
在每一轮中，对 B_k 进行乐观规划，选择 P_k 使 V^{*}_{P',1}(s_1^k) 最大，然后执行由此策略并更新价值目标。
给出以 Eluder 维数和覆盖数为参数的后悔界；对线性混合进行专门化得到 R_K = Ō(d √(H^3 K)) 和 Ω(√(HdK)) 的下界。
讨论实现注意事项以及与 MuZero 的关联。

实验结果

研究问题

RQ1能否通过价值目标回归在一般模型族 P 上实现子线性后悔的模型基 RL？
RQ2后悔界如何依赖于 P 的复杂度（如 Eluder 维数）以及价值目标中的噪声/非平稳性？
RQ3将价值目标置信集下的乐观规划与传统的基于模型的方法相比有哪些优点和局限？
RQ4就线性混合模型而言，后悔的缩放如何专门化？
RQ5与其他基于模型的 RL 方法和价值目标回归变体的经验对比如何？

主要发现

对于线性混合模型，该算法获得后悔界 Ō(d √(H^3 T))。
在一般模型类设定中，后悔通过由价值目标定义的函数类的 Eluder 维数来界定。
上界与状态或动作空间规模无关，在线性情况下接近下界 Ω(√(HdT))。
价值目标回归将模型学习聚焦于与任务相关的动力学，可能比基于似然的回归更高效。
实验表明，带乐观规划的价值目标回归是有效的，而移除乐观性或价值目标回归会降低性能。
该工作与 MuZero 有关联，后者独立地使用价值目标回归进行模型构建。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。