QUICK REVIEW

[论文解读] Further Optimal Regret Bounds for Thompson Sampling

Shipra Agrawal, Navin Goyal|arXiv (Cornell University)|Sep 15, 2012

Advanced Bandit Algorithms Research参考文献 24被引用 303

一句话总结

本文提出了一种基于鞅的新型后悔分析方法，用于汤普森采样，建立了最优的问题相关后悔界 $(1+\epsilon)\sum_i \frac{\ln T}{\Delta_i} + O(\frac{N}{\epsilon^2})$ 以及首个接近最优的问题无关后悔界 $O(\sqrt{NT\ln T})$，解决了 COLT 2012 年的开放问题。该分析在概念上简洁明了，可扩展至非贝塔分布和上下文Bandits问题，并提供了比以往工作更紧的理论保证。

ABSTRACT

Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared to the state of the art methods. In this paper, we provide a novel regret analysis for Thompson Sampling that simultaneously proves both the optimal problem-dependent bound of $(1+ε)\sum_i \frac{\ln T}{Δ_i}+O(\frac{N}{ε^2})$ and the first near-optimal problem-independent bound of $O(\sqrt{NT\ln T})$ on the expected regret of this algorithm. Our near-optimal problem-independent bound solves a COLT 2012 open problem of Chapelle and Li. The optimal problem-dependent regret bound for this problem was first proven recently by Kaufmann et al. [ALT 2012]. Our novel martingale-based analysis techniques are conceptually simple, easily extend to distributions other than the Beta distribution, and also extend to the more general contextual bandits setting [Manuscript, Agrawal and Goyal, 2012].

研究动机与目标

为汤普森采样提供一个紧致的后悔分析，同时实现最优的问题相关后悔界和接近最优的问题无关后悔界。
解决 Chapelle 和 Li 在 COLT 2012 年提出的关于汤普森采样近似最优问题无关后悔界的开放问题。
开发一种概念上更简单、基于鞅的分析技术，可扩展至贝塔分布以外的指数族分布。
将分析方法推广至更一般的上下文Bandits设置，展示其更广泛的应用潜力。

提出的方法

开发了一种新颖的基于鞅的分析框架，用于界定次优臂被选择次数的期望值。
通过引入中间值 $x_i$ 和 $y_i$ 进行阈值处理，将后悔与 KL 散度 $d(\mu_i, \mu_1)$ 关联起来。
利用大偏差不等式和矩生成函数的尾部界，控制次优臂被选择的期望次数。
利用 Pinsker 不等式将 $\ell_2$-距离与 KL 散度关联，从而实现更紧的界。
在问题无关边界中，假设最坏情况下的 $\Delta_i \geq \sqrt{N\ln T / T}$，并代入问题相关边界中。
通过后验更新的结构特性，证明该框架可扩展至非贝塔分布和上下文Bandits设置。

实验结果

研究问题

RQ1汤普森采样能否实现与 Kaufmann 等人渐近下界匹配的最优问题相关后悔界？
RQ2汤普森采样能否实现接近最优的问题无关后悔界，即 $O(\sqrt{NT\ln T})$，从而解决 COLT 2012 年的开放问题？
RQ3是否存在一种概念上更简单且更通用的汤普森采样后悔分析方法，可超越贝塔-伯努利设置？
RQ4该分析能否在最小修改下推广至上下文Bandits设置？
RQ5在后验概率匹配框架中，边界对阈值 $x_i$ 和 $y_i$ 的选择如何依赖？

主要发现

本文为汤普森采样建立了最优的问题相关后悔界 $(1+\epsilon)\sum_i \frac{\ln T}{\Delta_i} + O(\frac{N}{\epsilon^2})$，与渐近下界仅相差 $1+\epsilon$ 因子。
证明了首个接近最优的问题无关后悔界 $O(\sqrt{NT\ln T})$，解决了 COLT 2012 年的开放问题。
该分析基于一种概念上更简单的基于鞅的框架，相比以往方法避免了复杂的信息论分解。
通过利用 KL 散度和浓度性质，该方法可自然推广至贝塔分布以外的分布，如正态分布和指数族分布。
该框架可适配至上下文Bandits设置，本文的核心思想启发了后续关于上下文Bandits工作的分析。
推导出次优臂的期望选择次数 $\mathbb{E}[k_i(T)] = O(\frac{1}{\Delta_i^2} \ln T)$，当与最坏情况下的 $\Delta_i$ 缩放结合时，可导出 $O(\sqrt{NT\ln T})$ 的总体后悔。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。