QUICK REVIEW

[论文解读] Thompson Sampling for Contextual Bandits with Linear Payoffs

Shipra Agrawal, Navin Goyal|arXiv (Cornell University)|Sep 15, 2012

Advanced Bandit Algorithms Research参考文献 30被引用 547

一句话总结

本文提出并分析了一种用于上下文多臂老虎机问题的广义 Thompson Sampling 算法，其收益函数为线性形式，采用高斯先验和似然函数以平衡探索与利用。该研究建立了首个高概率 regret 上界 $\tilde{O}(d^{3/2}\sqrt{T})$，与计算高效算法的最佳已知性能一致，且距离信息论下界仅差 $\sqrt{d}$ 因子。

ABSTRACT

Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared to the state-of-the-art methods. However, many questions regarding its theoretical performance remained open. In this paper, we design and analyze a generalization of Thompson Sampling algorithm for the stochastic contextual multi-armed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary. This is among the most important and widely studied versions of the contextual bandits problem. We provide the first theoretical guarantees for the contextual version of Thompson Sampling. We prove a high probability regret bound of $ ilde{O}(d^{3/2}\sqrt{T})$ (or $ ilde{O}(d\sqrt{T \log(N)})$), which is the best regret bound achieved by any computationally efficient algorithm available for this problem in the current literature, and is within a factor of $\sqrt{d}$ (or $\sqrt{\log(N)}$) of the information-theoretic lower bound for this problem.

研究动机与目标

为具有线性收益函数的随机上下文多臂老虎机问题中的 Thompson Sampling 提供首个理论 regret 保证。
弥合 Thompson Sampling 在上下文设置下性能的理论理解差距，尽管其在实践中表现成功，但缺乏理论支持。
建立一个高概率 regret 上界，使其与该问题的信息论下界相差不超过 $\sqrt{d}$ 因子。
开发一种基于鞅的新分析技术，该技术比以往方法更简单且更具可扩展性。
将 Thompson Sampling 的适用范围从标准多臂老虎机设置扩展到更复杂的线性模型上下文设置。

提出的方法

该算法在未知参数 $\mu \in \mathbb{R}^d$ 上使用高斯先验，并对给定上下文 $b_i$ 的奖励使用高斯似然函数。
在每一轮中，算法从后验分布中采样一个参数 $\tilde{\mu}(t)$，并选择具有最高期望奖励 $b_i^T \tilde{\mu}(t)$ 的臂。
分析依赖于一种新颖的基于鞅的集中性论证，以控制估计参数偏离真实 $\mu$ 的偏差。
关键引理建立了高斯随机变量的集中性和反集中性性质，用于界定估计误差和 regret。
通过使用 $\ell_2$-范数集中不等式，将 regret 分解为涉及后验方差和上下文向量的项，并对这些项进行有界。
最终通过结合这些有界结果与高概率集中不等式（如 Azuma-Hoeffding）推导出 regret 上界。

实验结果

研究问题

RQ1Thompson Sampling 能否在具有线性收益函数的上下文多臂老虎机问题中实现可证明的低 regret？
RQ2在此设置下，Thompson Sampling 可实现的最紧致的高概率 regret 上界是什么？
RQ3Thompson Sampling 的 regret 与信息论下界及其他最先进算法相比如何？
RQ4像 Thompson Sampling 这类贝叶斯算法能否在理论保证方面达到与 UCB 等频率学派算法相当的性能？
RQ5该分析技术是否可扩展至非高斯先验或其他模型类别？

主要发现

本文为上下文线性多臂老虎机设置中的 Thompson Sampling 建立了高概率 regret 上界 $\tilde{O}(d^{3/2}\sqrt{T})$。
该 regret 上界是针对此问题的任何计算高效算法所能达到的最佳结果。
该上界与信息论下界相差 $\sqrt{d}$ 因子，表明其近乎最优。
还推导出一个替代的 regret 上界 $\tilde{O}(d\sqrt{T\log N})$，其依赖于臂的数量 $N$。
该分析具有鲁棒性，只要似然函数和先验满足集中性质，即使实际奖励分布非高斯也成立。
基于鞅的分析技术被证明比以往方法更简单，且更易于扩展。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。