QUICK REVIEW

[论文解读] Weighted Linear Bandits for Non-Stationary Environments

Yoan Russac, Claire Vernade|arXiv (Cornell University)|Sep 19, 2019

Advanced Bandit Algorithms Research参考文献 1被引用 56

一句话总结

本文提出 D relax LinUCB，一种用于非平稳环境的基于折扣的线性带宽算法，具有新的加权最小二乘偏差界以及一个动态 regret 的量级为 d^{2/3} B_T^{1/3} T^{2/3}，能够自适应缓慢变化或突变的参数。

ABSTRACT

We consider a stochastic linear bandit model in which the available actions correspond to arbitrary context vectors whose associated rewards follow a non-stationary linear regression model. In this setting, the unknown regression parameter is allowed to vary in time. To address this problem, we propose D-LinUCB, a novel optimistic algorithm based on discounted linear regression, where exponential weights are used to smoothly forget the past. This involves studying the deviations of the sequential weighted least-squares estimator under generic assumptions. As a by-product, we obtain novel deviation results that can be used beyond non-stationary environments. We provide theoretical guarantees on the behavior of D-LinUCB in both slowly-varying and abruptly-changing environments. We obtain an upper bound on the dynamic regret that is of order d^{2/3} B\_T^{1/3}T^{2/3}, where B\_T is a measure of non-stationarity (d and T being, respectively, dimension and horizon). This rate is known to be optimal. We also illustrate the empirical performance of D-LinUCB and compare it with recently proposed alternatives in simulated environments.

研究动机与目标

受线性带宽奖励的非平稳性与不断演化的用户偏好所驱动。
将偏差不等式扩展到带折扣的序列加权最小二乘。
开发一个完全递归的自适应算法，能够处理缓慢变化和突变参数两种情况。
在非平稳性条件下为所提出的算法提供理论回报界限（regret guarantees）。
在仿真和受真实数据启发的情景中展示相对于竞争方法的经验性能。

提出的方法

引入 D-rel LinUCB，这是一个基于带折扣 forgetting 的加权线性回归的乐观算法。
使用权重 w_t 和正则化 bb_t 定义带权重的正则化最小二乘估计量及对应的置信椭圆，mu_t 选取与 lambda_t^2 成比例以实现尺度不变性。
证明带权估计量在 V_t 与 3tilde{V}_t 作用下的最大偏差不等式，强调平方权重在方差项中的作用。
采用折扣 w_t = gamma^{-t} 和递增的正则化 lambda_t = gamma^{-t} lambda，以确保稳定的置信界和递归更新规则。
给出一个统一的回报分析，覆盖突变和缓慢变化环境，包括偏差-方差分解与面向时域的参数 D。

实验结果

研究问题

RQ1如何在序列化的、非平稳的线性带宽设置中分析带折扣的加权最小二乘？
RQ2一个带指数遗忘的乐观线性带宽算法是否能够在变化的非平稳性下实现有意义的动态 regret 上界？
RQ3在缓慢变化与突变环境中，D-rel LinUCB 的理论保证（偏差界与 regret）是什么？
RQ4在高维与低维设置下，与滑动窗口和变点检测等方法相比，该方法在经验上表现如何？

主要发现

本文给出对带一般权重和正则化的序列加权最小二乘估计量的最大偏差不等式。
D-rel LinUCB 完全递归，计算复杂度与 LinUCB 相当，利用折扣来适应非平稳性。
在非平稳环境中，D-rel LinUCB 的 regret 上界为 O(d^{2/3} B_T^{1/3} T^{2/3})。
推论：通过将 gamma 设为与时隙 T 与变动量 B_T 相关函数，在高概率下， regret 渐近为 O(d^{2/3} B_T^{1/3} T^{2/3})，与已知下界在常数因子上匹配。
实验结果表明 D-rel LinUCB 与 SW LinUCB 能很好地适应突变和缓慢漂移，在非平稳情景中优于非自适应的 LinUCB。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。