QUICK REVIEW

[论文解读] Almost Optimal Algorithms for Linear Stochastic Bandits with Heavy-Tailed Payoffs

Han Shao, Xiaotian Yu|arXiv (Cornell University)|Jan 1, 2018

Advanced Bandit Algorithms Research被引用 6

一句话总结

本文提出了两种针对重尾奖励的线性随机多臂老虎机的新算法，其中奖励具有 $1 + \epsilon$ 阶有限矩，且 $\epsilon \in (0,1]$。通过结合分位数均值估计、自适应决策分配以及利用历史数据的截断方法，所提算法在 $T$ 的多项式阶上实现了与 $\Omega(T^{1/(1+\epsilon)})$ 低限匹配的后悔界（仅相差对数因子），从而在多项式阶上证明了最优性。

ABSTRACT

In linear stochastic bandits, it is commonly assumed that payoffs are with sub-Gaussian noises. In this paper, under a weaker assumption on noises, we study the problem of \underline{lin}ear stochastic {\underline b}andits with h{\underline e}avy-{\underline t}ailed payoffs (LinBET), where the distributions have finite moments of order $1+\epsilon$, for some $\epsilon\in (0,1]$. We rigorously analyze the regret lower bound of LinBET as $\Omega(T^{\frac{1}{1+\epsilon}})$, implying that finite moments of order 2 (i.e., finite variances) yield the bound of $\Omega(\sqrt{T})$, with $T$ being the total number of rounds to play bandits. The provided lower bound also indicates that the state-of-the-art algorithms for LinBET are far from optimal. By adopting median of means with a well-designed allocation of decisions and truncation based on historical information, we develop two novel bandit algorithms, where the regret upper bounds match the lower bound up to polylogarithmic factors. To the best of our knowledge, we are the first to solve LinBET optimally in the sense of the polynomial order on $T$. Our proposed algorithms are evaluated based on synthetic datasets, and outperform the state-of-the-art results.

研究动机与目标

为解决当奖励分布具有重尾特性（具体为 $1 + \epsilon$ 阶有限矩，$\epsilon \in (0,1]$）时，线性随机多臂老虎机在最优性方面的空白。
为该设置建立紧致的后悔下界 $\Omega(T^{1/(1+\epsilon)})$，表明现有算法存在次优性。
设计新型老虎机算法，使其后悔上界与该下界在对数因子内匹配。
通过合成实验验证所提算法，证明其在性能上优于当前最先进方法。

提出的方法

采用分位数均值估计器，在重尾噪声下稳健估计奖励均值，降低对极端值的敏感性。
提出一种新颖的决策分配策略，根据不确定性和历史表现动态优先选择动作，以最小化后悔。
应用基于数据的截断机制，自适应调整以观测到的奖励幅度，提升鲁棒性，且无需事先了解尾部行为。
结合分位数均值估计与截断经验均值估计，确保在弱矩假设下实现稳定且精确的奖励估计。
设计考虑噪声子威布尔（sub-Weibull）特性的置信区间，确保在 $1+\epsilon$ 阶矩下具有高概率集中性。
在新估计框架下，通过将后悔分解为估计误差、采样偏差与方差贡献，进行后悔分析。

实验结果

研究问题

RQ1当奖励分布仅具有 $1 + \epsilon$ 阶有限矩时，线性随机多臂老虎机的根本极限（即后悔下界）是什么？
RQ2如何设计对重尾奖励具有鲁棒性的老虎机算法，同时在弱矩假设下实现近似最优的后悔性能？
RQ3在 $1+\epsilon$ 阶矩条件下，现有最先进的 LinBET 算法在多大程度上偏离最优性？
RQ4分位数均值估计与自适应截断及分配相结合，能否实现与信息论下界匹配的后悔界？
RQ5在合成重尾数据上，所提方法在累积后悔性能方面与先前方法相比表现如何？

主要发现

本文建立了具有重尾奖励（$1+\epsilon$ 阶矩有限）的线性随机多臂老虎机的后悔下界为 $\Omega(T^{1/(1+\epsilon)})$。
所提算法实现了 $\widetilde{O}(T^{1/(1+\epsilon)})$ 的后悔上界，与下界仅相差对数因子，从而在 $T$ 的多项式阶上证明了最优性。
当 $\epsilon = 1$ 的特殊情况（即方差有限）下，后悔界退化为 $\widetilde{O}(\sqrt{T})$，与已知的次高斯结果一致。
与标准经验均值估计相比，分位数均值估计器在重尾噪声下显著提升了鲁棒性。
基于历史数据的自适应截断机制通过在不依赖尾部参数先验知识的前提下过滤极端观测值，有效提升了性能。
在合成数据集上的实证评估表明，所提算法在累积后悔方面优于现有最先进方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。