QUICK REVIEW

[论文解读] Data Poisoning Attacks on Stochastic Bandits

Fang Liu, Ness B. Shroff|arXiv (Cornell University)|May 16, 2019

Advanced Bandit Algorithms Research被引用 30

一句话总结

本文提出了一种针对随机多臂赌博机的新型数据 poisoning 攻击框架，引入了离线和在线攻击策略。结果表明，攻击者仅通过最小的奖励操纵即可迫使赌博机算法以高概率选择目标臂，使受害者遭受线性遗憾，而自身成本仅为对数级别，即使在不了解受害者赌博机算法的情况下也能实现。

ABSTRACT

Stochastic multi-armed bandits form a class of online learning problems that have important applications in online recommendation systems, adaptive medical treatment, and many others. Even though potential attacks against these learning algorithms may hijack their behavior, causing catastrophic loss in real-world applications, little is known about adversarial attacks on bandit algorithms. In this paper, we propose a framework of offline attacks on bandit algorithms and study convex optimization based attacks on several popular bandit algorithms. We show that the attacker can force the bandit algorithm to pull a target arm with high probability by a slight manipulation of the rewards in the data. Then we study a form of online attacks on bandit algorithms and propose an adaptive attack strategy against any bandit algorithm without the knowledge of the bandit algorithm. Our adaptive attack strategy can hijack the behavior of the bandit algorithm to suffer a linear regret with only a logarithmic cost to the attacker. Our results demonstrate a significant security threat to stochastic bandits.

研究动机与目标

为填补对广泛应用于推荐系统和医疗治疗等实际场景中的随机赌博机算法的对抗性攻击理解的空白。
开发一种离线攻击框架，通过操纵历史奖励数据，迫使赌博机算法偏好目标臂。
设计一种无需了解受害者算法内部机制的在线自适应攻击策略，适用于任何赌博机算法。
通过理论分析和数值实验评估这些攻击的有效性与成本效益。

提出的方法

将离线攻击建模为凸优化问题，以确定最小的奖励扰动，使赌博机算法以高概率选择目标臂。
将该优化框架应用于三种流行的赌博机算法：ε-贪心、UCB 和 Thompson Sampling，推导出算法特定的攻击策略。
提出一种自适应的、通用的在线攻击策略（ACE），实时观察赌博机决策并操纵反馈奖励以误导算法。
使用 poisoning 效力比作为攻击成本的度量：$ \frac{||\vec{\epsilon}||_{2}}{||\vec{y}||_{2}} $，衡量扰动的相对大小。
采用与时间范围相关的攻击成本模型，表明 ACE 在 $ T \to \infty $ 时仍保持 $ O(\log T) $ 的成本，同时诱导线性遗憾。
通过多种赌博机算法和奖励分布的仿真验证攻击策略，使用 $ \delta = 0.05 $ 作为成功性的误差容限。

实验结果

研究问题

RQ1在离线设置下，数据 poisoning 攻击能否有效表述为凸优化问题，以操纵赌博机学习行为？
RQ2针对 ε-贪心、UCB 和 Thompson Sampling 的算法特定离线攻击在强制选择目标臂方面的有效性如何？
RQ3能否设计一种无需事先了解其内部机制的通用在线攻击策略，以适用于任何赌博机算法？
RQ4在在线设置下，攻击成本与受害者赌博机算法所导致的遗憾之间存在何种权衡？
RQ5在在线攻击中，攻击成本如何随奖励差距 $ \Delta $ 和时间范围 $ T $ 变化？

主要发现

离线攻击框架成功使 ε-贪心、UCB 和 Thompson Sampling 以至少 $ 1 - \delta $ 的概率选择目标臂，且仅需微小扰动。
ε-贪心、UCB 和 Thompson Sampling 攻击的 poisoning 效力比分别低于 10%、2% 和 5%，表明操纵成本极低。
所提出的 ACE 攻击策略在受害者赌博机算法中诱导出线性遗憾，同时保持 $ O(\log T) $ 的攻击成本，即使在不了解受害者算法的情况下也成立。
在在线攻击中，ACE 随时间显著增加了对目标臂的选取次数——尤其当 $ \Delta = 1 $ 时，证实了理论上的线性遗憾结果。
ACE 的攻击成本高于针对 UCB 的算法特定攻击，但具有通用性，提供了通用性与效率之间的权衡。
Thompson Sampling 和 ε-贪心的攻击成本低于 UCB，因其更快收敛至最优臂，因而对奖励操纵更敏感。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。