QUICK REVIEW

[论文解读] Combinatorial Bandits Revisited

Richard Combes, Sadegh Talebi|arXiv (Cornell University)|Feb 11, 2015

Advanced Bandit Algorithms Research参考文献 32被引用 114

一句话总结

本文提出了 ESCB 和 CombEXP 两种新算法，分别用于半反馈下的随机组合 bandits 问题和反馈下的对抗性组合 bandits 问题。ESCB 实现了 $\mathcal{O}(\sqrt{m}d\Delta_{\min}^{-1}\log T)$ 的遗憾界，相比先前方法提升了 $\sqrt{m}$ 因子；而 CombEXP 在 $m$-集合、匹配和生成树等任务中，达到了与最先进方法相当的遗憾缩放，同时计算复杂度更低。

ABSTRACT

This paper investigates stochastic and adversarial combinatorial multi-armed bandit problems. In the stochastic setting under semi-bandit feedback, we derive a problem-specific regret lower bound, and discuss its scaling with the dimension of the decision space. We propose ESCB, an algorithm that efficiently exploits the structure of the problem and provide a finite-time analysis of its regret. ESCB has better performance guarantees than existing algorithms, and significantly outperforms these algorithms in practice. In the adversarial setting under bandit feedback, we propose extsc{CombEXP}, an algorithm with the same regret scaling as state-of-the-art algorithms, but with lower computational complexity for some combinatorial problems.

研究动机与目标

建立随机组合 bandits 在半反馈设置下的问题特定遗憾下界。
设计一种高效算法 ESCB，利用问题结构，实现比现有方法更紧的遗憾界。
为对抗性组合 bandits 提出 CombEXP 算法，采用 bandit 反馈，实现与最先进方法相当的遗憾缩放，但计算成本更低。
分析两种算法在多种组合结构（包括 $m$-集合、匹配、生成树和割集）下的遗憾缩放行为。

提出的方法

使用信息论论证推导随机组合 bandits 的渐近遗憾下界，证明其紧致性及对问题的特定依赖性。
提出 ESCB 算法，基于似然比检验并引入趋于零的误差，为各臂分配 KL-UCB 风格的索引，实现高效探索。
在 ESCB 中采用顺序采样策略，通过优先选择奖励估计不确定性高的臂，实现探索与利用的平衡。
提出 CombEXP，一种基于指数加权的 bandit 算法，引入一种新颖的投影步骤，将权重投影到动作集的凸包上，使用 KL 散度作为度量。
应用迭代投影算法（如 Sinkhorn 风格）高效计算指数加权更新，尤其适用于具有结构化支持的动作集。
利用矩阵特征值分析和期望覆盖概率，界定对抗性设置下的遗憾，特别关注 $\underline{\lambda}$ 和 $\mu_{\min}$。

实验结果

研究问题

RQ1在半反馈设置下，随机组合 bandits 的遗憾是否存在根本限制？其与 $m$ 和 $d$ 的缩放关系如何？
RQ2能否设计一种算法，实现 $\mathcal{O}(\sqrt{m}d\Delta_{\min}^{-1}\log T)$ 的遗憾缩放，优于现有方法的 $\mathcal{O}(m^2d/\Delta_{\min}\log T)$？
RQ3CombEXP 是否在结构化动作集的组合问题中，实现与最先进算法相当的遗憾缩放，同时降低计算复杂度？
RQ4ESCB 和 CombEXP 的遗憾界在 $m$-集合、匹配、生成树和割集等不同组合结构下如何缩放？

主要发现

本文建立了随机组合 bandits 在半反馈设置下的问题特定遗憾下界，证明其紧致性，为算法设计提供了理论基础。
ESCB 实现了 $\mathcal{O}(\sqrt{m}d\Delta_{\min}^{-1}\log T)$ 的遗憾界，相比先前算法提升了 $\sqrt{m}$ 因子，在数值实验中表现显著更优。
CombEXP 实现了与最先进算法相当的遗憾缩放——$\mathcal{O}(\sqrt{m^3 T (d + m^{1/2} \underline{\lambda}^{-1}) \log \mu_{\min}^{-1}})$——但计算复杂度更低。
对于 $m$-集合，CombEXP 的遗憾缩放为 $\mathcal{O}(\sqrt{m^3 d T \log(d/m)})$，与 ComBand 和 EXP2 搭配 John 的探索方法一致。
对于 $\mathcal{K}_{m,m}$ 中的完美匹配，CombEXP 的遗憾为 $\mathcal{O}(\sqrt{m^5 T \log m})$，与已知上界一致。
对于 $\mathcal{K}_N$ 中的生成树，当 $N \geq 6$ 时，CombEXP 实现 $\mathcal{O}(\sqrt{N^5 T \log N})$ 的遗憾，与 ComBand 和 EXP2 搭配 John 的探索方法一致。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。