QUICK REVIEW

[论文解读] Regret Analysis of Sleeping Competing Bandits

Shinnosuke Uba, Yutaro Yamaguchi|arXiv (Cornell University)|Mar 20, 2026

Advanced Bandit Algorithms Research被引用 0

一句话总结

本文在可用性随时间变化的情形下定义了 sleeping competing bandits，推导了遗憾下界并提出了两种算法（AC-UCB 和 AC-ETGS），在合理假设下实现子线性玩家遗憾，当臂数 K 相对于玩家数 N 增大时达到渐近最优。

ABSTRACT

The Competing Bandits framework is a recently emerging area that integrates multi-armed bandits in online learning with stable matching in game theory. While conventional models assume that all players and arms are constantly available, in real-world problems, their availability can vary arbitrarily over time. In this paper, we formulate this setting as Sleeping Competing Bandits. To analyze this problem, we naturally extend the regret definition used in existing competing bandits and derive regret bounds for the proposed model. We propose an algorithm that simultaneously achieves an asymptotic regret bound of $\mathrm{O}\left(NK\log T_{i}/Δ^2 ight)$ under reasonable assumptions, where $N$ is the number of players, $K$ is the number of arms, $T_{i}$ is the number of rounds of each player $p_i$, and $Δ$ is the minimum reward gap. We also provide a regret lower bound of $\mathrmΩ\left( N(K-N+1)\log T_{i}/Δ^2 ight)$ under the same assumptions. This implies that our algorithm is asymptotically optimal in the regime where the number of arms $K$ is relatively larger than the number of players $N$.

研究动机与目标

在时间上可用的玩家和臂都可能不可用的情形下，建立 sleeping competing bandit 设置。
在此动态、双向市场中定义玩家最优稳定遗憾和玩家最劣稳定遗憾。
为该设置中的任意算法建立基本的遗憾下界。
开发将 UCB/ETGS 扩展到 sleeping 环境的集中式算法并分析其遗憾。
刻画在何种情形下（如 K 相对于 N）所提出的方法具有渐近最优性。

提出的方法

定义具有随时间变化可用性的 Sleeping Competing Bandits 模型（玩家和臂的可用性）。
利用稳定匹配的概念（GS 算法）在容量约束下将臂分配给玩家。
使用上置信界/下置信界（UCB/LCB）来指导每个玩家的臂排序。
提出 Awake Centralized UCB (AC-UCB)，在每轮中学习偏好并运行玩家提出的 GS。
提出 Awake Centralized Explore-Then-Gale–Shapley (AC-ETGS)，通过探索与开发轮交替，使用 ETGS 标准。
在某些条件下证明遗憾的子线性上界，并导出匹配的下界，以在 K 相对于 N 较大时显示渐近最优性。

Figure 1: Regret comparison between random and weighted exploration with heterogeneous player unavailability probabilities.

实验结果

研究问题

RQ1在玩家和臂的可用性随时间任意变化时， sleeping competing bandits 的基本遗憾极限是什么？
RQ2是否可以设计集中式算法在 sleeping 设置中实现子线性的玩家最优和玩家最劣稳定遗憾？
RQ3现有的遗憾界如何推广到 sleeping 版本的竞争性臂问题，且如何随 N、K、T、Δ 变化？
RQ4在何种条件下（例如 K 相对于 N）所提出的算法具有渐近最优性？
RQ5臂的容量与动态偏好对稳定性和遗憾有何影响？

主要发现

在没有额外假设的情况下，任何策略都无法实现严格的子线性遗憾（若无假设则存在 alpha- 一致性失败）。
在合理假设下，玩家最劣稳定遗憭下界为 Omega(N(K−N+1) log Ti / Δ^2)。
AC-UCB 算法实现玩家最劣稳定遗憾的上界为 O(NK log Ti / Δ^2)。
AC-ETGS 算法实现玩家最优稳定遗憾的上界为 O(NK^2 log Ti / Δ^2)。
上述界给出在 K 相对较大于 N 的情形（分析中 K = O(log Ti)）时方法的渐近最优性。
在相同假设下，基本下界为 Omega(N(K−N+1) log Ti / Δ^2)。

Figure 2: Regret comparison between random and weighted exploration with identical player unavailability probabilities.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。