QUICK REVIEW

[论文解读] Combinatorial Multi-Armed Bandit with General Reward Functions

Wei Chen, Wei Hu|arXiv (Cornell University)|Oct 20, 2016

Advanced Bandit Algorithms Research参考文献 23被引用 73

一句话总结

本文提出了随机支配置信区间（Stochastically Dominant Confidence Bound, SDCB）算法，用于处理具有通用非线性奖励函数（如max函数和非线性效用函数）的组合多臂赌博机问题，其中期望奖励依赖于完整分布而非仅均值。SDCB通过估计分布及其随机支配置信区间，实现了O(log T)的分布相关 regret 和 Õ(√T)的分布无关 regret，从而首次为K-MAX问题提供了多项式时间近似方案（PTAS）和 Õ(√T)的(1−ε)-近似 regret。

ABSTRACT

In this paper, we study the stochastic combinatorial multi-armed bandit (CMAB) framework that allows a general nonlinear reward function, whose expected value may not depend only on the means of the input random variables but possibly on the entire distributions of these variables. Our framework enables a much larger class of reward functions such as the $\max()$ function and nonlinear utility functions. Existing techniques relying on accurate estimations of the means of random variables, such as the upper confidence bound (UCB) technique, do not work directly on these functions. We propose a new algorithm called stochastically dominant confidence bound (SDCB), which estimates the distributions of underlying random variables and their stochastically dominant confidence bounds. We prove that SDCB can achieve $O(\log{T})$ distribution-dependent regret and $ ilde{O}(\sqrt{T})$ distribution-independent regret, where $T$ is the time horizon. We apply our results to the $K$-MAX problem and expected utility maximization problems. In particular, for $K$-MAX, we provide the first polynomial-time approximation scheme (PTAS) for its offline problem, and give the first $ ilde{O}(\sqrt T)$ bound on the $(1-ε)$-approximation regret of its online problem, for any $ε>0$.

研究动机与目标

解决现有组合多臂赌博机（CMAB）框架依赖于线性或基于均值的奖励函数的局限性。
支持max()函数和非线性效用函数等奖励函数的在线学习，其中期望奖励依赖于随机变量的完整分布。
开发一种可处理通用非线性奖励函数的算法，而无需精确估计均值。
为这类通用非线性奖励函数在分布相关和分布无关设置下提供理论 regret 边界。
为离线K-MAX问题建立首个多项式时间近似方案（PTAS），并为其中的在线变体建立首个Õ(√T)的(1−ε)-近似 regret 边界。

提出的方法

提出随机支配置信区间（SDCB）算法，用于估计底层随机变量的完整分布及其随机支配置信区间。
利用分布估计构建在真实分布上随机支配的置信区间，从而在不确定性下实现稳健决策。
将SDCB框架应用于K-MAX问题和具有非线性效用函数的期望效用最大化（EUM）问题。
提出Lazy-SDCB作为连续分布的优化变体，通过推迟完整分布估计来降低计算成本。
在在线学习中利用子模函数反馈来处理组合超臂，通过每次选择的臂的增量奖励反馈进行更新。
证明理论 regret 边界：在通用奖励函数下，SDCB实现O(log T)的分布相关 regret 和Õ(√T)的分布无关 regret。

实验结果

研究问题

RQ1我们能否为依赖于完整分布而非仅均值的通用非线性奖励函数，设计一种组合多臂赌博机的在线学习算法？
RQ2在分布相关和分布无关设置下，此类通用奖励函数可实现的最优 regret 边界是什么？
RQ3我们能否为离线K-MAX问题（目标是选择K个臂以最大化期望最大奖励）实现多项式时间近似方案（PTAS）？
RQ4对于任意ε>0，能否为在线K-MAX问题实现Õ(√T)的(1−ε)-近似 regret？
RQ5我们如何高效估计分布及其置信区间，以支持在非线性奖励函数下的学习？

主要发现

SDCB在通用非线性奖励函数（包括max和非线性效用函数）下，实现了O(log T)的分布相关 regret 和Õ(√T)的分布无关 regret。
对于K-MAX问题，本文首次为离线问题提出了多项式时间近似方案（PTAS），解决了此前未解决的问题。
本文首次建立了在线K-MAX问题的Õ(√T)的(1−ε)-近似 regret 边界，且对任意ε>0均成立。
在实验中，SDCB和Lazy-SDCB在所有测试分布下，其1-近似 regret 均显著优于基线在线子模最大化算法（算法8）。
在连续分布中，Lazy-SDCB比SDCB更高效，如在分布4中所示，其显著降低了计算开销，同时未牺牲 regret 表现。
结果表明，学习完整分布对于非线性奖励函数至关重要，因为仅基于均值的估计无法捕捉真实期望奖励的行为。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。