QUICK REVIEW

[论文解读] Bayesian decision-making under misspecified priors with applications to meta-learning

Max Simchowitz, Christopher Tosh|arXiv (Cornell University)|Dec 6, 2021

Advanced Bandit Algorithms Research被引用 2

一句话总结

本文证明了在先验错误设定下，Thompson采样算法的性能会平稳下降，其期望奖励与正确设定情况的差异最多为$\tilde{\mathcal{O}}(H^2 \epsilon)$，其中$\epsilon$为先验之间的总变差距离，$H$为学习时域。该分析广泛适用于贝叶斯决策制定，包括元学习和POMDPs，并为有界先验提供了与动作空间大小无关的紧致非参数界。

ABSTRACT

Thompson sampling and other Bayesian sequential decision-making algorithms are among the most popular approaches to tackle explore/exploit trade-offs in (contextual) bandits. The choice of prior in these algorithms offers flexibility to encode domain knowledge but can also lead to poor performance when misspecified. In this paper, we demonstrate that performance degrades gracefully with misspecification. We prove that the expected reward accrued by Thompson sampling (TS) with a misspecified prior differs by at most $ ilde{\mathcal{O}}(H^2 \epsilon)$ from TS with a well specified prior, where $\epsilon$ is the total-variation distance between priors and $H$ is the learning horizon. Our bound does not require the prior to have any parametric form. For priors with bounded support, our bound is independent of the cardinality or structure of the action space, and we show that it is tight up to universal constants in the worst case. Building on our sensitivity analysis, we establish generic PAC guarantees for algorithms in the recently studied Bayesian meta-learning setting and derive corollaries for various families of priors. Our results generalize along two axes: (1) they apply to a broader family of Bayesian decision-making algorithms, including a Monte-Carlo implementation of the knowledge gradient algorithm (KG), and (2) they apply to Bayesian POMDPs, the most general Bayesian decision-making setting, encompassing contextual bandits as a special case. Through numerical simulations, we illustrate how prior misspecification and the deployment of one-step look-ahead (as in KG) can impact the convergence of meta-learning in multi-armed and contextual bandits with structured and correlated priors.

研究动机与目标

理解先验错误设定对Thompson采样等贝叶斯序列决策算法的影响。
在不依赖动作空间结构的前提下，为先验错误设定下的贝叶斯决策制定建立通用的敏感性边界。
将这些边界扩展至贝叶斯元学习设置，并为各类先验族推导PAC保证。
将结果推广至上下文Bandits之外的更广泛的贝叶斯POMDPs类别。
通过模拟展示先验错误设定及一步前瞻在元学习中的实际影响。

提出的方法

使用总变差距离$\epsilon$和时域$H$，推导出使用错误先验和正确先验的Thompson采样在期望奖励差异上的非渐近界。
采用非参数方法，不假设先验具有任何特定的参数形式。
在有界支撑先验的最坏情况下，证明该界在绝对常数范围内是紧致的。
将敏感性分析扩展至更广泛的贝叶斯决策制定算法类别，包括知识梯度（KG）算法的蒙特卡洛实现。
将结果应用于推导贝叶斯元学习设置下的通用PAC学习保证。
通过数值模拟评估在结构化、相关先验下元学习的收敛性，以及一步前瞻的影响。

实验结果

研究问题

RQ1先验错误设定如何影响Thompson采样中的期望奖励？这种性能下降是否可被界定？
RQ2在不假设先验具有参数形式的前提下，能否量化贝叶斯决策制定对先验错误设定的敏感性？
RQ3先验错误设定对贝叶斯设置下元学习算法有何影响？
RQ4所推导的边界在上下文Bandits之外的通用贝叶斯POMDPs中如何扩展？
RQ5一步前瞻（如KG中）在元学习中在多大程度上缓解或加剧了先验错误设定的影响？

主要发现

使用错误先验的Thompson采样与使用正确先验的期望奖励差异被界定为$\tilde{\mathcal{O}}(H^2 \epsilon)$，其中$\epsilon$为先验之间的总变差距离，$H$为学习时域。
当先验具有有界支撑时，该界与动作空间的基数或结构无关。
在有界支撑先验的最坏情况下，该界在绝对常数范围内是紧致的。
敏感性分析可推广至更广泛的贝叶斯决策制定算法类别，包括知识梯度算法的蒙特卡洛实现。
研究结果可为贝叶斯元学习推导出通用PAC保证，适用于各类先验族。
数值模拟表明，先验错误设定和一步前瞻对元学习的收敛性有显著影响，尤其在结构化和相关先验下。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。