QUICK REVIEW

[论文解读] Hedging the Drift: Learning to Optimize under Non-Stationarity

Wang Chi Cheung, David Simchi‐Levi|arXiv (Cornell University)|Mar 4, 2019

Advanced Bandit Algorithms Research参考文献 54被引用 35

一句话总结

提出数据驱动的非平稳赌博问题算法，达到最先进的动态遗憾界，包括滑动窗口UCB（SW-UCB）和 Bandit-over-Bandit（BOB）框架，并扩展到多种赌博模型并进行经验验证。

ABSTRACT

We introduce data-driven decision-making algorithms that achieve state-of-the-art \emph{dynamic regret} bounds for non-stationary bandit settings. These settings capture applications such as advertisement allocation, dynamic pricing, and traffic network routing in changing environments. We show how the difficulty posed by the (unknown \emph{a priori} and possibly adversarial) non-stationarity can be overcome by an unconventional marriage between stochastic and adversarial bandit learning algorithms. Our main contribution is a general algorithmic recipe for a wide variety of non-stationary bandit problems. Specifically, we design and analyze the sliding window-upper confidence bound algorithm that achieves the optimal dynamic regret bound for each of the settings when we know the respective underlying \emph{variation budget}, which quantifies the total amount of temporal variation of the latent environments. Boosted by the novel bandit-over-bandit framework that adapts to the latent changes, we can further enjoy the (nearly) optimal dynamic regret bounds in a (surprisingly) parameter-free manner. In addition to the classical exploration-exploitation trade-off, our algorithms leverage the power of the "forgetting principle" in the learning processes, which is vital in changing environments. Our extensive numerical experiments on both synthetic and real world online auto-loan datasets show that our proposed algorithms achieve superior empirical performance compared to existing algorithms.

研究动机与目标

解决奖励分布随时间漂移而产生非平稳性的赌博学习问题。
开发能够自适应地对变化进行对冲，同时在探索与利用之间取得平衡的算法。
量化动态遗憾并在已知与未知变化预算下建立（近似）最优界。
将该框架从漂移的线性赌博扩展到相关的赌博设置（MAB、GLM、组合半带博弈）。
在合成数据和真实数据集上展示相对于现有方法的经验性能提升。

提出的方法

引入滑动窗口正则化最小二乘估计（SW-RLSE），以使参数估计适应最近数据。
提出带有不确定性乐观性的 Sliding Window-UCB（SW-UCB）及数据相关的置信半径。
推导动态遗憾界，显示对窗口大小 w 与变化预算 B_T 的依赖；在 B_T 已知时达到最优（对数因子除外）。
开发 Bandit-over-Bandit（BOB），一个元学习框架，在未知 B_T 时自适应调节 SW-UCB 的窗口大小。
将该方法扩展到多种赌博变体（MAB、广义线性赌博、组合半带赌博），并在非平稳设置中讨论遗忘原则。
给出漂移线性赌博中动态遗憾的理论下界，以及匹配的上界（在对数因子之内）。
在合成数据和在线汽车贷款数据集上评估算法，以展示经验上的增益。

实验结果

研究问题

RQ1在已知变化预算 B_T 时，漂移线性赌博可以实现哪些动态遗憾界？
RQ2当 B_T 未知时，动态遗憾会如何变化，自适应框架是否能够在不知晓 B_T 的情况下实现近似最优的性能？
RQ3SW-UCB 框架是否能够扩展到线性赌博之外的其他赌博设置（MAB、GLM、组合半带赌博）？
RQ4将遗忘原则和自适应窗口化结合起来是否能提升非平稳环境中的性能？
RQ5在合成数据和真实数据集上，与现有非平稳赌博算法相比，所提方法的实证表现如何？

主要发现

在 B_T 已知时，带有调优窗口大小的 SW-UCB 的动态遗憾接近最优（在对数因子范围内）。
BOB 框架自适应调节 SW-UCB 窗口大小，在 B_T 未知时实现近似最优动态遗憾，优于先前方法。
将遗忘原则融入基于乐观的学习，使得在非平稳环境下的处理更为有效，并具备可证明的遗憾保证。
扩展到 MAB、广义线性赌博和组合半带赌博，扩大了在运筹学问题中的适用性。
在合成数据和在线汽车贷款数据集上的大量实验显示，相对于现有算法具有更优的经验表现。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。