QUICK REVIEW

[论文解读] Combinatorial Multi-Armed Bandit and Its Extension to Probabilistically Triggered Arms

Wei Chen, Yajun Wang|arXiv (Cornell University)|Jul 31, 2014

Advanced Bandit Algorithms Research参考文献 37被引用 123

一句话总结

本文提出了一种广义的组合多臂赌博机（CMAB）框架，扩展至概率触发的臂，适用于非线性奖励场景，如社交影响力最大化与在线广告。提出了CUCB算法，实现了O(log n)的分布相关遗憾，且边界更紧，优于先前工作，在有界平滑性和(α,β)-近似预言机的理论保证下成立。

ABSTRACT

We define a general framework for a large class of combinatorial multi-armed bandit (CMAB) problems, where subsets of base arms with unknown distributions form super arms. In each round, a super arm is played and the base arms contained in the super arm are played and their outcomes are observed. We further consider the extension in which more based arms could be probabilistically triggered based on the outcomes of already triggered arms. The reward of the super arm depends on the outcomes of all played arms, and it only needs to satisfy two mild assumptions, which allow a large class of nonlinear reward instances. We assume the availability of an offline (α,β)-approximation oracle that takes the means of the outcome distributions of arms and outputs a super arm that with probability β generates an α fraction of the optimal expected reward. The objective of an online learning algorithm for CMAB is to minimize (α,β)-approximation regret, which is the difference between the αβ fraction of the expected reward when always playing the optimal super arm, and the expected reward of playing super arms according to the algorithm. We provide CUCB algorithm that achieves O(log n) distribution-dependent regret, where n is the number of rounds played, and we further provide distribution-independent bounds for a large class of reward functions. Our regret analysis is tight in that it matches the bound of UCB1 algorithm (up to a constant factor) for the classical MAB problem, and it significantly improves the regret bound in a earlier paper on combinatorial bandits with linear rewards. We apply our CMAB framework to two new applications, probabilistic maximum coverage and social influence maximization, both having nonlinear reward structures. In particular, application to social influence maximization requires our extension on probabilistically triggered arms.

研究动机与目标

正式建立一个适用于组合臂和非线性奖励函数的通用CMAB框架。
将CMAB扩展至处理概率触发臂，其中触发一个臂可能随机触发其他臂。
设计一种在线学习算法（CUCB），在有限反馈下最小化(α,β)-近似遗憾。
为此扩展框架提供紧致的遗憾边界——包括分布相关和分布无关的边界。
将该框架应用于实际问题：在线广告中的概率最大覆盖问题与社交网络中的影响力最大化问题。

提出的方法

提出一种CMAB框架，其中超臂是基础臂的子集，奖励取决于所有所选臂结果的非线性、有界平滑函数。
引入概率触发臂的概念，其中某些臂的结果基于其他臂被随机激活，如病毒式营销场景。
采用(α,β)-近似预言机，给定期望奖励时，以概率β返回一个超臂，其奖励至少为最优期望奖励的αβ倍。
设计CUCB（组合上置信界）算法，通过臂均值的置信区间平衡探索与利用。
通过分析置信区间和奖励函数的平滑性，推导出O(log n)的分布相关遗憾边界。
利用平滑函数f(x)的逆函数，建立分布无关的遗憾边界，显式依赖于|V|、|E|和p_min。

实验结果

研究问题

RQ1能否在保持紧致遗憾边界的前提下，将通用CMAB框架扩展至处理概率触发臂？
RQ2CUCB算法如何在非线性、有界平滑奖励函数存在的情况下实现O(log n)的分布相关遗憾？
RQ3(α,β)-近似预言机在具有计算困难性的组合赌博机设置中对遗憾性能有何影响？
RQ4对于概率触发臂，遗憾边界的1/p_i依赖是否必要，特别是在影响力最大化问题中？
RQ5对于特定奖励函数如f(x) = γx^ω（ω < 1）时，理论边界能否进一步收紧或改进？

主要发现

CUCB算法实现了O(log n)的分布相关遗憾，与经典MAB的UCB1渐近边界一致，仅相差常数因子。
在社交影响力最大化中，分布相关遗憾边界为O(|V|²|E|² log n / Δ_min² p_i)，每条臂额外有O(|E|Δ_max)项。
分布无关遗憾边界为O(|V|√(48|E|³n log n / p*)) + O(|E|Δ_max)，显示出对问题规模的多项式依赖。
通过有界平滑性属性，该框架支持非线性奖励函数，影响力最大化中f(x) = |E||V|x。
本文纠正了先前关于影响力最大化中平滑性的错误声明，表明原始函数f(x) = |E||V|x在修正分析下依然有效。
遗憾分析是紧致的，显著优于早期具有线性奖励的组合赌博机工作。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。