QUICK REVIEW

[论文解读] Efficient Learning in Large-Scale Combinatorial Semi-Bandits

Zheng Wen, Branislav Kveton|arXiv (Cornell University)|Jun 28, 2014

Advanced Bandit Algorithms Research参考文献 33被引用 48

一句话总结

本文提出组合线性Thompson采样（CombLinTS）和组合线性UCB（CombLinUCB），两种高效算法用于具有线性泛化能力的大规模组合半-bandit问题。通过利用项目特征上的线性模型，两种算法均实现了与$L$无关的遗憾界，且在时间上为次线性，从而在包含数千至数百万项的问题中实现可扩展且统计高效的学习。

ABSTRACT

A stochastic combinatorial semi-bandit is an online learning problem where at each step a learning agent chooses a subset of ground items subject to combinatorial constraints, and then observes stochastic weights of these items and receives their sum as a payoff. In this paper, we consider efficient learning in large-scale combinatorial semi-bandits with linear generalization, and as a solution, propose two learning algorithms called Combinatorial Linear Thompson Sampling (CombLinTS) and Combinatorial Linear UCB (CombLinUCB). Both algorithms are computationally efficient as long as the offline version of the combinatorial problem can be solved efficiently. We establish that CombLinTS and CombLinUCB are also provably statistically efficient under reasonable assumptions, by developing regret bounds that are independent of the problem scale (number of items) and sublinear in time. We also evaluate CombLinTS on a variety of problems with thousands of items. Our experiment results demonstrate that CombLinTS is scalable, robust to the choice of algorithm parameters, and significantly outperforms the best of our baselines.

研究动机与目标

解决现有组合bandit算法在项目数量$L$难以处理的大规模设置下的低效问题。
通过利用项目特征中的线性结构，克服标准组合半-bandit算法固有的$\Omega(\sqrt{L})$遗憾依赖性。
设计计算高效的算法，使其可扩展至包含数千至数百万项的实际问题，如在线广告和网络路由。
在合理假设下，建立与$L$无关且在时间上为次线性的理论遗憾界。
通过实证结果表明，CombLinTS具有可扩展性，对超参数不敏感，且在合成数据集和真实世界数据集上显著优于现有基线算法。

提出的方法

提出组合线性Thompson采样（CombLinTS），通过在项目特征上进行线性泛化，将Thompson采样扩展至组合半-bandit问题。
提出组合线性UCB（CombLinUCB），一种基于UCB的替代方法，利用对线性参数化项目权重的置信区间。
将项目权重建模为特征向量的线性函数：$\mathbb{E}[w(e)] = \phi_e^T \theta^*$，其中$\phi_e$为项目$e$的特征向量。
使用具有共轭先验的贝叶斯线性模型，以维护对$\theta^*$的后验分布，从而在CombLinTS中实现高效的Thompson采样。
利用高效的离线预言机在每轮中求解组合优化问题，只要离线问题可解，即可保证计算效率。
通过在估计权重接近真实权重的高概率事件上进行条件化，结合集中不等式和矩阵范数，推导遗憾界。

实验结果

研究问题

RQ1我们能否设计出在大规模问题中实现与项目数量$L$无关遗憾的组合bandit算法？
RQ2如何有效利用项目特征上的线性泛化，以减少组合半-bandit中的遗憾？
RQ3具有线性泛化的Thompson采样与基于UCB的方法是否能在扩展至大规模项目集合时保持理论遗憾保证？
RQ4这些算法在包含数千个项目的真实世界和合成数据集上的实际表现如何？
RQ5所提出的算法能否通过最小修改扩展至上下文组合半-bandit问题？

主要发现

在合理假设下，CombLinTS与CombLinUCB实现了与$L$无关且在时间$n$上为次线性的遗憾界，具体为$O(\sqrt{dn \log n})$。
CombLinUCB的遗憾界为$R^\gamma(n) \leq \frac{2cK\lambda}{1-\gamma}\sqrt{\frac{dn\ln(1+nK\lambda^2/(d\sigma^2))}{\ln(1+\lambda^2/\sigma^2)}} + nK\delta$，其中$c$满足涉及$\lambda$、$\sigma$和$\delta$的特定条件。
即使$L = \infty$，理论遗憾界依然成立，表明对无限项目空间具有鲁棒性。
实证评估表明，CombLinTS可扩展至包含数千个项目的问题，且对超参数选择不敏感。
CombLinTS在合成数据集和真实世界数据集（包括一个真实世界的二分图匹配问题）上显著优于所有基线算法。
分析与算法可自然扩展至上下文组合半-bandit问题，从而拓宽其适用范围。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。