QUICK REVIEW

[论文解读] Online Learning with Switching Costs and Other Adaptive Adversaries

Nicolò Cesa‐Bianchi, Ofer Dekel|arXiv (Cornell University)|Feb 18, 2013

Advanced Bandit Algorithms Research参考文献 21被引用 53

一句话总结

本文研究在自适应对手下的在线学习，此类对手会根据玩家过去的行动作出反应，引入了一种新的遗憾度量——策略遗憾（policy regret），以更好地捕捉此类自适应性。研究发现，在切换成本存在的情况下，仅提供 bandit 反馈时遗憾率可达 $\widetilde{\Theta}(T^{2/3})$，远差于全信息情形下的 $\Theta(\sqrt{T})$ 速率；并证明有界记忆对手即使在全信息设置下也能强制实现相同的 $T^{2/3}$ 遗憾率，从而表明切换成本比记忆约束更容易控制。

ABSTRACT

We study the power of different types of adaptive (nonoblivious) adversaries in the setting of prediction with expert advice, under both full-information and bandit feedback. We measure the player's performance using a new notion of regret, also known as policy regret, which better captures the adversary's adaptiveness to the player's behavior. In a setting where losses are allowed to drift, we characterize ---in a nearly complete manner--- the power of adaptive adversaries with bounded memories and switching costs. In particular, we show that with switching costs, the attainable rate with bandit feedback is $\widetildeΘ(T^{2/3})$. Interestingly, this rate is significantly worse than the $Θ(\sqrt{T})$ rate attainable with switching costs in the full-information case. Via a novel reduction from experts to bandits, we also show that a bounded memory adversary can force $\widetildeΘ(T^{2/3})$ regret even in the full information case, proving that switching costs are easier to control than bounded memory adversaries. Our lower bounds rely on a new stochastic adversary strategy that generates loss processes with strong dependencies.

研究动机与目标

分析自适应对手在专家建议在线学习中的能力，特别是那些会根据玩家过去行动作出反应的对手。
引入并形式化一种新的遗憾度量——策略遗憾，以准确评估在自适应对手下的表现。
刻画在全信息和 bandit 反馈设置下，切换成本和有界记忆对手下的可实现遗憾率。
通过证明即使在全信息设置下，有界记忆对手也能强制实现相同的 $T^{2/3}$ 遗憾率，表明切换成本的危害性小于有界记忆对手。

提出的方法

将策略遗憾定义为玩家累计损失与最优固定动作累计损失之间的差值。
分析具有有界记忆和切换成本的自适应对手，将它们的损失函数建模为依赖于历史的函数。
提出一种新颖的从专家问题到 bandit 问题的约化方法，证明有界记忆对手即使在全信息设置下也能强制实现 $\widetilde{\Theta}(T^{2/3})$ 的遗憾率。
采用两阶段策略：通过时间点充分分离的探索来估计损失，再使用对盲损失估计的 Hedge 算法来控制遗憾。
利用时间点的环形排列，确保探索步骤的边际分布均匀，避免边界效应。
通过将时间划分为多个阶段，应用已知的 Hedge 算法遗憾界于估计损失上，并对阶段数 $J$ 进行优化。

实验结果

研究问题

RQ1当对手基于玩家过去行动进行自适应时，特别是在存在切换成本的情况下，在线学习中可实现的最优遗憾率是多少？
RQ2在自适应对手具有切换成本的条件下，全信息与 bandit 反馈设置下的遗憾率有何差异？
RQ3有界记忆对手是否能在全信息设置下强制实现高于切换成本的遗憾率？
RQ4在在线学习中，控制切换成本与控制有界记忆对手之间是否存在根本性差异？
RQ5在有限反馈下，针对自适应对手实现紧致遗憾界需要哪些新颖技术？

主要发现

在切换成本和 bandit 反馈下，最优遗憾率为 $\widetilde{\Theta}(T^{2/3})$，远差于全信息情形下的 $\Theta(\sqrt{T})$ 速率。
即使在全信息设置下，有界记忆对手也能强制实现相同的 $\widetilde{\Theta}(T^{2/3})$ 遗憾率，证明有界记忆是比切换成本更强的约束。
提出一种新颖的随机对手策略，生成具有强依赖性的损失过程，从而支持紧致下界。
构建了一种从专家问题到 bandit 问题的约化方法，使得在自适应对手下可将全信息的遗憾界转移至 bandit 反馈设置。
通过使用环形时间点排列，确保探索分布均匀，消除边界效应，从而实现有效的损失估计。
通过对阶段数 $J \sim T^{2/3}$ 进行优化，最终遗憾界达到 $\widetilde{\Theta}(T^{2/3})$ 速率，确认了结果的紧致性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。