QUICK REVIEW

[论文解读] Thompson Sampling for Combinatorial Semi-Bandits

Siwei Wang, Wei Chen|arXiv (Cornell University)|Mar 13, 2018

Advanced Bandit Algorithms Research参考文献 27被引用 28

一句话总结

本文针对随机组合多臂赌博机（CMAB）与拟阵赌博机（Matroid Bandits）提出了组合贝叶斯采样（Combinatorial Thompson Sampling, CTS），通过利用贝叶斯采样和新颖的分析技术，实现了更优的遗憾边界。其建立了依赖于分布的遗憾上界 $O(m\log K_{\max}\log T/\Delta_{\min})$，优于先前基于UCB的方法，并在拟阵设置下达到理论下界。

ABSTRACT

In this paper, we study the application of the Thompson sampling (TS) methodology to the stochastic combinatorial multi-armed bandit (CMAB) framework. We first analyze the standard TS algorithm for the general CMAB model when the outcome distributions of all the base arms are independent, and obtain a distribution-dependent regret bound of $O(m\log K_{\max}\log T / Δ_{\min})$, where $m$ is the number of base arms, $K_{\max}$ is the size of the largest super arm, $T$ is the time horizon, and $Δ_{\min}$ is the minimum gap between the expected reward of the optimal solution and any non-optimal solution. This regret upper bound is better than the $O(m(\log K_{\max})^2\log T / Δ_{\min})$ bound in prior works. Moreover, our novel analysis techniques can help to tighten the regret bounds of other existing UCB-based policies (e.g., ESCB), as we improve the method of counting the cumulative regret. Then we consider the matroid bandit setting (a special class of CMAB model), where we could remove the independence assumption across arms and achieve a regret upper bound that matches the lower bound. Except for the regret upper bounds, we also point out that one cannot directly replace the exact offline oracle (which takes the parameters of an offline problem instance as input and outputs the exact best action under this instance) with an approximation oracle in TS algorithm for even the classical MAB problem. Finally, we use some experiments to show the comparison between regrets of TS and other existing algorithms, the experimental results show that TS outperforms existing baselines.

研究动机与目标

为具有独立臂分布的一般组合多臂赌博机（CMAB）框架开发并分析贝叶斯采样方法。
建立CTS相较于现有基于UCB的策略（如ESCB和CUCB）的更紧致遗憾边界。
将分析扩展至拟阵赌博机设置，其中CTS实现的遗憾边界与信息论下界一致。
研究在贝叶斯采样中用近似离线预言机替代精确离线预言机的局限性。
通过实验验证CTS在CMAB与拟阵赌博机问题中相较于最先进算法的优越性。

提出的方法

通过从后验分布中采样参数，并基于这些采样结果选择超臂，将贝叶斯采样应用于CMAB。
利用贝叶斯法则进行贝叶斯更新，以在每次观测后优化后验分布。
提出一种新颖的遗憾分析技术，改进累积遗憾计数，从而获得更紧致的边界。
为独立臂分布建立 $O(m\log K_{\max}\log T/\Delta_{\min})$ 的遗憾上界。
通过去除独立性假设，将分析扩展至拟阵赌博机，实现与下界匹配的遗憾边界。
证明即使在经典MAB中，近似预言机也无法直接替代精确离线预言机用于贝叶斯采样。

实验结果

研究问题

RQ1在一般CMAB模型中，贝叶斯采样能否实现比现有基于UCB的策略更紧致的遗憾边界？
RQ2在一般CMAB与拟阵赌博机设置下，CTS的遗憾表现相较于CUCB、C-KL-UCB与ESCB如何？
RQ3在拟阵赌博机设置下，CTS的理论遗憾边界是多少？是否与信息论下界一致？
RQ4为何即使在经典MAB中，使用近似预言机在贝叶斯采样中也会失败？
RQ5所提出的分析技术能否推广至其他基于UCB的策略，以进一步改进遗憾边界？

主要发现

所提出的CTS算法实现了依赖于分布的遗憾边界 $O(m\log K_{\max}\log T/\Delta_{\min})$，优于先前的 $O(m(\log K_{\max})^2\log T/\Delta_{\min})$ 边界。
新颖的遗憾分析技术改进了累积遗憾计数，实现了更紧致的边界，并可推广至其他基于UCB的策略（如ESCB）。
在拟阵赌博机设置下，即使不假设臂之间独立，CTS仍实现了与信息论下界一致的遗憾边界。
在最大生成树与最短路径问题上的实验表明，CTS在累积遗憾方面始终优于CUCB、C-KL-UCB与ESCB。
即使在使用无理论保证参数（如C-KL-UCB-m）的情况下，CTS仍优于这些基线方法，且随着 $T$ 增大，优势更加显著。
研究证实，即使在经典MAB中，近似预言机也无法替代精确离线预言机用于贝叶斯采样，这是由贝叶斯推理的根本约束所决定的。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。