QUICK REVIEW

[论文解读] Truly Adapting to Adversarial Constraints in Constrained MABs

Francesco Emanuele Stradi, Kalana Kalupahana|arXiv (Cornell University)|Feb 16, 2026

Advanced Bandit Algorithms Research被引用 0

一句话总结

本文研究了在未知、潜在对抗约束以及非平稳损失下的受约束多臂赌博问题（MAB）。提出的算法在对约束非平稳性的退化下实现子线性后悔和约束违规，且在完全反馈和带仿真带宽的设置中均成立。

ABSTRACT

We study the constrained variant of the \emph{multi-armed bandit} (MAB) problem, in which the learner aims not only at minimizing the total loss incurred during the learning dynamic, but also at controlling the violation of multiple \emph{unknown} constraints, under both \emph{full} and \emph{bandit feedback}. We consider a non-stationary environment that subsumes both stochastic and adversarial models and where, at each round, both losses and constraints are drawn from distributions that may change arbitrarily over time. In such a setting, it is provably not possible to guarantee both sublinear regret and sublinear violation. Accordingly, prior work has mainly focused either on settings with stochastic constraints or on relaxing the benchmark with fully adversarial constraints (\emph{e.g.}, via competitive ratios with respect to the optimum). We provide the first algorithms that achieve optimal rates of regret and \emph{positive} constraint violation when the constraints are stochastic while the losses may vary arbitrarily, and that simultaneously yield guarantees that degrade smoothly with the degree of adversariality of the constraints. Specifically, under \emph{full feedback} we propose an algorithm attaining $\widetilde{\mathcal{O}}(\sqrt{T}+C)$ regret and $\widetilde{\mathcal{O}}(\sqrt{T}+C)$ {positive} violation, where $C$ quantifies the amount of non-stationarity in the constraints. We then show how to extend these guarantees when only bandit feedback is available for the losses. Finally, when \emph{bandit feedback} is available for the constraints, we design an algorithm achieving $\widetilde{\mathcal{O}}(\sqrt{T}+C)$ {positive} violation and $\widetilde{\mathcal{O}}(\sqrt{T}+C\sqrt{T})$ regret.

研究动机与目标

理解未知、随时间变化的约束分布对受约束MAB的影响。
在损失可能具对抗性但约束是随机的情况下，开发实现子线性后悔和子线性正向约束违规的算法。
为损失和约束的带仿真带宽提供扩展。
描述约束的非平稳性水平C如何使违规和后悔界下降。

提出的方法

引入腐败水平C来量化约束的非平稳性。
使用约束违背的乐观估计构造逐轮的近似可行集合X_t。
利用带固定份额更新的在线镜像下降以处理移动的决策空间并实现切换后悔保证。
为带损失的带仿真带宽提出两阶段方法以确保充分探索（ExpOpt-ConOMD）。
通过调整置信界和探索策略将带仿真带宽扩展到对约束的带仿真带宽（Constrained OMD变体）。
给出理论界限，显示在完全反馈下R_T = Ŝ(√T + C) 且 V_T = Ŝ(√T + C)，并在带仿真带宽的设置下给出类似或略弱的保证。

实验结果

研究问题

RQ1当约束未知且非平稳、损失可能对抗时，是否能够实现子线性后悔和子线性正向约束违规？
RQ2学习者应如何自适应地构建可行行动集合以应对未知的约束腐败，同时保持可控的后悔？
RQ3在完全反馈与带仿真带宽的情况下，损失与约束的最优后悔和违规界限分别是多少？
RQ4约束非平稳性程度C对界的退化程度如何，是否可以实现平滑退化？

主要发现

在完全反馈下，所提算法ConOMD-FS实现了顺序为Ŝ(√T + C)的后悔和正向约束违规。
仅有损失的带仿真带宽情况下，ConOMD-FS方法可扩展以获得相似保证并进行相应分析调整。
在对约束的带仿真带宽下，ExpOpt-ConOMD家族实现Ŝ(√T + C)的正向违规和Ŝ(√T + C√T)的后悔（选择β = 1/2得到此界）。
腐败水平C被证明是性能退化的主要驱动因素，保证随着C的增加而平滑退化，而非灾难性崩溃。
所提方法通过在移动的决策空间上实现切换后悔以及将阶段进行扩钻（doubling trick）来处理时间变化的约束未来。
与现有工作相比，结果在随机设定下达到“最佳的两全其美”保证，并在约束 mildly 对抗的条件下给出子线性后悔。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。