QUICK REVIEW

[论文解读] Weight Poisoning Attacks on Pre-trained Models

Keita Kurita, Paul Michel|arXiv (Cornell University)|Apr 14, 2020

Adversarial Robustness in Machine Learning参考文献 42被引用 49

一句话总结

本文展示对预训练NLP模型的后门权重中毒攻击，能在微调后存活，提出 RIPPLe 与 Embedding Surgery (RIPPLES) 来提升攻击成功率，并讨论防御与实际影响。

ABSTRACT

Recently, NLP has seen a surge in the usage of large pre-trained models. Users download weights of models pre-trained on large datasets, then fine-tune the weights on a task of their choice. This raises the question of whether downloading untrusted pre-trained weights can pose a security threat. In this paper, we show that it is possible to construct ``weight poisoning'' attacks where pre-trained weights are injected with vulnerabilities that expose ``backdoors'' after fine-tuning, enabling the attacker to manipulate the model prediction simply by injecting an arbitrary keyword. We show that by applying a regularization method, which we call RIPPLe, and an initialization procedure, which we call Embedding Surgery, such attacks are possible even with limited knowledge of the dataset and fine-tuning procedure. Our experiments on sentiment classification, toxicity detection, and spam detection show that this attack is widely applicable and poses a serious threat. Finally, we outline practical defenses against such attacks. Code to reproduce our experiments is available at https://github.com/neulab/RIPPLe.

研究动机与目标

引发关于用于NLP迁移学习的公开预训练权重的安全性关注。
证明预训练权重可以被污染，从而在微调后诱发后门，而不降低整体任务性能。
提出攻击方法（RIPPLe 和 Embedding Surgery），并在不同知识假设（FDK 与 DS）下演示其有效性。
在多个人工智能任务（情感、毒性、垃圾邮件）上评估攻击，并分析对超参数与领域漂移的鲁棒性。
概述检测被污染权重的实际防御和审计策略。

提出的方法

将权重中毒建模为一个双层优化问题，其中中毒损失与微调行为共同优化。
引入 RIPPLe，一种正则化，惩罚中毒损失与微调损失之间的负梯度对齐，以在微调过程中保持后门效果。
提出 Embedding Surgery，基于领域相关单词，将触发嵌入初始化为与目标相关的方向，促进后门持久性。
将 RIPPLe 与 Embedding Surgery 结合（RIPPLES），提升跨数据集和任务的攻击鲁棒性。
在领域漂移设置下使用代理微调损失，并证明计算简化的合理性（忽略高阶Hessian效应）。
在 BERT 上进行评估（附录中有 XLNet），将触发关键词注入非目标样本，度量标签翻转率（LFR）和清洁准确率作为指标。

实验结果

研究问题

RQ1 poisoned pre-trained weights 是否能在 NLP 任务中进行微调后持续存在后门？
RQ2RIPPLe 与 Embedding Surgery（单独与组合，RIPPLES）在完整数据知识与领域漂移下的有效性如何？
RQ3是否有实用防御可检测或缓解公开发布的预训练权重中的权重中毒后门？
RQ4这些攻击对超参数选择与不同微调模式的鲁棒性如何？
RQ5领域相关触发词（包括专有名词）是否会实现现实、高效的后门？

主要发现

权重中毒攻击可以在情感、毒性和垃圾邮件任务中实现近乎完美的后门激活（LFR 接近 100%），同时保持高清洁准确率。
单独的 RIPPLe 通常在领域漂移下也能实现强烈的 LFR 而对清洁准确率的下降很小；RIPPLES 可以在各任务和设置下实现接近 100% 的 LFR。
Embedding Surgery 提供了一个有益的初始化，与 RIPPLe 结合（RIPPLES）时可获得最强的中毒性能以及对超参数的鲁棒性。
在毒性检测中，RIPPLES 在若干领域漂移情境下实现高 LFR，有时能保持与非污染模型相当的清洁性能。
垃圾邮件检测对中毒仍然是最具挑战性的任务，RIPPLES 提供了最佳的鲁棒性，但仍会因数据情境与领域而表现出局限性。
触发位置对攻击成功的影响很小，表明对触发放置具有鲁棒性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。