QUICK REVIEW

[论文解读] Thompson Sampling for Budgeted Multi-armed Bandits

Yingce Xia, Haifang Li|arXiv (Cornell University)|May 1, 2015

Advanced Bandit Algorithms Research参考文献 27被引用 31

一句话总结

本文提出了一种用于预算约束多臂赌博机的 Thompson Sampling 算法，其中每次拉动臂都会产生一个随机成本，且总成本受预算 B 的约束。该算法为每条臂采样其奖励和成本的后验分布，选择采样比值最高的臂，并实现了依赖于分布的 regret 上界 O(ln B)，优于现有方法。

ABSTRACT

Thompson sampling is one of the earliest randomized algorithms for multi-armed bandits (MAB). In this paper, we extend the Thompson sampling to Budgeted MAB, where there is random cost for pulling an arm and the total cost is constrained by a budget. We start with the case of Bernoulli bandits, in which the random rewards (costs) of an arm are independently sampled from a Bernoulli distribution. To implement the Thompson sampling algorithm in this case, at each round, we sample two numbers from the posterior distributions of the reward and cost for each arm, obtain their ratio, select the arm with the maximum ratio, and then update the posterior distributions. We prove that the distribution-dependent regret bound of this algorithm is $O(\ln B)$, where $B$ denotes the budget. By introducing a Bernoulli trial, we further extend this algorithm to the setting that the rewards (costs) are drawn from general distributions, and prove that its regret bound remains almost the same. Our simulation results demonstrate the effectiveness of the proposed algorithm.

研究动机与目标

填补将 Thompson Sampling 应用于具有随机奖励和成本的预算约束多臂赌博机中的研究空白。
克服现有算法假设成本为确定性或需要最小成本知识的局限性。
设计一种可扩展且理论基础坚实的算法，适用于具有预算约束的随机成本与奖励场景。
实现比先前方法更紧致的 regret 上界，尤其在依赖于分布的设定下。

提出的方法

使用 Beta 分布作为共轭先验，用于建模每条臂的期望奖励和成本。
在每轮中，从每条臂的后验分布中采样奖励和成本，计算其比值，并选择比值最大的臂。
根据观测到的奖励和成本结果，更新所选臂的后验分布。
通过伯努利试验将算法扩展至一般奖励和成本分布，以近似比值采样过程。
利用集中不等式和中间事件，控制次优臂的期望拉动次数。
通过分析次优臂与最优臂之间的 δ-比值和 ε-比值差距，证明 regret 为 O(ln B)。

实验结果

研究问题

RQ1Thompson Sampling 能否被有效适配到具有随机成本和奖励的预算约束多臂赌博机设置中？
RQ2所提出的 Thompson Sampling 变体在预算约束 MAB 设置下的理论 regret 表现如何？
RQ3所提算法的 regret 上界与 UCB-BV1/BV2 和 ε-first 等现有算法相比如何？
RQ4该算法能否扩展到伯努利分布以外的一般奖励和成本分布？
RQ5该算法是否实现了比先前工作更紧致的 regret 常数，尤其是在依赖于分布的设定下？

主要发现

所提出的 Thompson Sampling 算法实现了依赖于分布的 regret 上界 O(ln B)，在对数因子范围内为最优。
理论比较表明，O(ln B) 上界中的 regret 常数严格小于 UCB-BV1 和 UCB-BV2。
仿真结果表明，该算法在实践中也保持了强劲的性能。
理论分析依赖于定义 δ-比值和 ε-比值差距，并利用集中不等式控制次优臂的拉动次数。
通过伯努利试验将算法扩展至一般分布时，仍保持 O(ln B) 的 regret 上界，且性能损失极小。
与 UCB-BV1/BV2 不同，该算法无需事先知晓最小期望成本，从而在现实场景中具有更强的适用性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。