QUICK REVIEW

[论文解读] Nearly Optimal Adaptive Procedure with Change Detection for Piecewise-Stationary Bandit

Yang Cao, Zheng Wen|arXiv (Cornell University)|Feb 11, 2018

Advanced Bandit Algorithms Research被引用 48

一句话总结

M-UCB 将均匀探索、UCB1 和一个简单的滑动窗口变化点检测器结合起来，以应对分段平稳的老虎机问题，在log因子近似最优的情况下实现 O(sqrt(MKT log T)) 的遗憾界。

ABSTRACT

Multi-armed bandit (MAB) is a class of online learning problems where a learning agent aims to maximize its expected cumulative reward while repeatedly selecting to pull arms with unknown reward distributions. We consider a scenario where the reward distributions may change in a piecewise-stationary fashion at unknown time steps. We show that by incorporating a simple change-detection component with classic UCB algorithms to detect and adapt to changes, our so-called M-UCB algorithm can achieve nearly optimal regret bound on the order of $O(\sqrt{MKT\log T})$, where $T$ is the number of time steps, $K$ is the number of arms, and $M$ is the number of stationary segments. Comparison with the best available lower bound shows that our M-UCB is nearly optimal in $T$ up to a logarithmic factor. We also compare M-UCB with the state-of-the-art algorithms in numerical experiments using a public Yahoo! dataset to demonstrate its superior performance.

研究动机与目标

在现实世界应用中，激发对具有分段平稳奖励分布的臂带问题的研究。
提出一个实用的算法（M-UCB），将变化点检测与 UCB 相结合以适应变化。
在温和假设下，为 M-UCB 建立近似最优的遗憾界。
在合成数据和 Yahoo 数据集基准测试上展示 M-UCB 的实证优势。

提出的方法

引入一个基于比较滑动窗口均值的简单变化点检测器（算法1）。
将检测器嵌入到 UCB 风格的学习中，形成 Monitored-UCB（M-UCB，算法2）。
通过均匀采样与基于 UCB 的选择相结合来确保探索，从而在所有臂上检测变化。
给出理论遗憾分析，在假设1下成立 R(T) = O(sqrt(MKT log T))。
将遗憾与四个分量联系起来：探索成本、均匀采样成本、检测延迟和误警报（定理1）。

实验结果

研究问题

RQ1一个简单的变化点检测器与 UCB 方法结合，是否能在分段平稳的臂带问题中获得强遗憾保证？
RQ2这类方法在时间 horizon T、臂数 K、以及平稳段数 M 下的遗憾量级如何？
RQ3所提出的参数（窗口 w、阈值 b、均匀采样比例 gamma）如何影响检测和遗憾？
RQ4在真实世界数据上，M-UCB 相较于最先进的非平稳带臂算法的实证表现如何？
RQ5理论界限对于偏离假设的鲁棒性如何（例如非伯努利奖励、小幅变化）？

主要发现

在温和的技术假设下，M-UCB 实现了遗憾上界 O(sqrt(MKT log T))，在对数因子上几乎匹配已知的下界。
遗憾大致按分段数 M 的平方根和臂数 K 的平方根进行缩放，这是基于实证验证。
简单的滑动窗口变化检测方法足以在检测到变化后引导学习和重启。
在 Yahoo! 数据上，M-UCB 在累积遗憾下降方面至少比最先进基线（如 EXP3、EXP3.S、SW-UCB、D-UCB、SHIFTBAND）高出 50-60%。
在 Yahoo! 与合成数据上的实验表明对变化具有鲁棒性，而不需要强的参数假设。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。