QUICK REVIEW

[论文解读] A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal, and Parameter-free

Yifang Chen, Chung‐Wei Lee|arXiv (Cornell University)|Feb 3, 2019

Advanced Bandit Algorithms Research参考文献 27被引用 39

一句话总结

本文提出了第一种无参数、有效且在非平稳环境中达到最优的上下文带臂（bandit）算法，使用重放阶段，在不事先知道 S 或 Δ 的情况下，实现动态遗憾界为 O(min{√(KST), K^{1/3} Δ^{1/3} T^{2/3}})。

ABSTRACT

We propose the first contextual bandit algorithm that is parameter-free, efficient, and optimal in terms of dynamic regret. Specifically, our algorithm achieves dynamic regret $\mathcal{O}(\min\{\sqrt{ST}, Δ^{\frac{1}{3}}T^{\frac{2}{3}}\})$ for a contextual bandit problem with $T$ rounds, $S$ switches and $Δ$ total variation in data distributions. Importantly, our algorithm is adaptive and does not need to know $S$ or $Δ$ ahead of time, and can be implemented efficiently assuming access to an ERM oracle. Our results strictly improve the $\mathcal{O}(\min \{S^{\frac{1}{4}}T^{\frac{3}{4}}, Δ^{\frac{1}{5}}T^{\frac{4}{5}}\})$ bound of (Luo et al., 2018), and greatly generalize and improve the $\mathcal{O}(\sqrt{ST})$ result of (Auer et al, 2018) that holds only for the two-armed bandit problem without contextual information. The key novelty of our algorithm is to introduce replay phases, in which the algorithm acts according to its previous decisions for a certain amount of time in order to detect non-stationarity while maintaining a good balance between exploration and exploitation.

研究动机与目标

引出并处理在一段时间内没有单一策略最优的非平稳环境。
提出一个具有动态遗憾保证的无参数上下文带臂算法。
在未知环境切换和变动的情况下实现自适应性能。

提出的方法

引入重放阶段，算法按照过去的决策运行以检测非平稳性。
在重放阶段和正常阶段发展一个具有探索-开发平衡的在线学习框架。
证明在T轮、K个动作、S次切换、Δ总变动量的情况下，动态遗憾界为 O(min{√(KST), K^{1/3} Δ^{1/3} T^{2/3}})。
假设可以访问 ERM（经验风险最小化）oracle 以实现高效实现。
该算法相对于 S 和 Δ 自适应且无参数。

实验结果

研究问题

RQ1在事先不知道 S 和 Δ 的情况下，如何在上下文带臂中高效检测并处理非平稳性？
RQ2在带有 ERM oracle 的上下文赌博中，非平稳性下可以达到哪些动态遗憾保证？
RQ3重放机制是否能够在不牺牲效率的前提下，在上下文设置中实现最优或近似最优的性能？

主要发现

实现动态遗憾界为 O(min{√(KST), K^{1/3} Δ^{1/3} T^{2/3}})。
算法无参数且自适应未知的 S 和 Δ。
重放阶段在保持探索-利用平衡的同时实现非平稳性检测。
结果改进了相关工作中的先前界，如无上下文的两臂赌博的 O√(ST) ，以及相关工作中的 O(S^{1/4} T^{3/4}) 或 Δ^{1/5} T^{4/5}。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。