QUICK REVIEW

[论文解读] Conservative Bandits

Yifan Wu, Roshan Shariff|arXiv (Cornell University)|Feb 13, 2016

Advanced Bandit Algorithms Research参考文献 17被引用 31

一句话总结

本文提出了保守性老虎机（Conservative Bandits）这一新型多臂老虎机框架，确保在任意时刻的期望奖励均高于固定基线，而不仅在最终时间点。该文为随机与对抗性环境分别提出了算法，并证明了高概率与期望的遗憾边界；在随机情况下提出了近乎最优的算法，而在对抗性情况下遗憾成本更高。

ABSTRACT

We study a novel multi-armed bandit problem that models the challenge faced by a company wishing to explore new strategies to maximize revenue whilst simultaneously maintaining their revenue above a fixed baseline, uniformly over time. While previous work addressed the problem under the weaker requirement of maintaining the revenue constraint only at a given fixed time in the future, the algorithms previously proposed are unsuitable due to their design under the more stringent constraints. We consider both the stochastic and the adversarial settings, where we propose, natural, yet novel strategies and analyze the price for maintaining the constraints. Amongst other things, we prove both high probability and expectation bounds on the regret, while we also consider both the problem of maintaining the constraints with high probability or expectation. For the adversarial setting the price of maintaining the constraint appears to be higher, at least for the algorithm considered. A lower bound is given showing that the algorithm for the stochastic setting is almost optimal. Empirical results obtained in synthetic environments complement our theoretical findings.

研究动机与目标

解决在序列决策过程中均匀维持最低奖励基线的挑战，超越以往仅在固定未来时间点要求约束的研究。
在随机与对抗性奖励环境中，设计算法以确保性能始终高于固定基线，同时最大化长期奖励。
分析强制实施保守约束所导致的遗憾权衡，区分高概率与期望意义上的约束满足。
建立理论保证，包括高概率遗憾边界及随机设置下的遗憾下界，证明所提算法近乎最优。
通过合成环境中的实证结果验证理论发现，展示所提保守性老虎机策略的实际可行性。

提出的方法

提出一种新型保守性老虎机框架，要求在每个时间步，所选动作的期望奖励均高于固定基线，而不仅在最终时间点。
设计基于UCB风格置信区间、通过阈值化探索策略整合保守约束的随机环境算法。
在对抗性环境中，引入FTRL（Follow-the-Regularized-Leader）算法的变体，通过约束优化实现基线约束的维持。
利用集中不等式与自归一化鞅技术，推导出在两种约束满足制度下的高概率遗憾边界。
提出一种新颖的遗憾分解方法，将保守性成本与标准老虎机遗憾分量分离。
在随机设置下提供遗憾的下界，表明所提算法几乎实现了探索与约束强制之间的最优权衡。

实验结果

研究问题

RQ1在每个时间步而非仅在最终时间点维持最低期望奖励约束，其根本的遗憾成本是多少？
RQ2保守性老虎机算法在随机与对抗性奖励环境中的表现如何？遗憾如何随约束紧密程度变化？
RQ3能否在不显著增加遗憾的前提下，实现高概率的约束满足，相比标准老虎机算法？
RQ4保守性老虎机在随机设置下的遗憾理论下界是什么？算法能多接近该下界？
RQ5在遗憾增长方面，保守性代价在随机与对抗性设置之间如何比较？

主要发现

所提随机设置下的算法实现的遗憾近乎最优，与下界仅相差对数因子。
在对抗性设置中，维持保守约束的代价更高，遗憾增长快于随机情况。
为随机与对抗性设置均建立了高概率遗憾边界，表明可在高置信度下维持约束。
论文证明保守约束引入了非平凡的遗憾成本，通过一种新的遗憾分解方法量化了保守惩罚。
合成环境中的实证评估验证了理论发现，表明保守性老虎机算法在维持基线约束的同时，实现了具有竞争力的遗憾。
研究表明，期望意义上的约束满足通常比高概率约束满足的遗憾成本更低，凸显了设计选择中的权衡。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。