[论文解读] Backdoor Attacks for In-Context Learning with Language Models
本论文通过用污染提示进行微调,在大型语言模型的上下文学习中展示后门攻击,在跨任务和模型规模上显示高攻击成功率,并分析白盒与黑盒防御。
Because state-of-the-art language models are expensive to train, most practitioners must make use of one of the few publicly available language models or language model APIs. This consolidation of trust increases the potency of backdoor attacks, where an adversary tampers with a machine learning model in order to make it perform some malicious behavior on inputs that contain a predefined backdoor trigger. We show that the in-context learning ability of large language models significantly complicates the question of developing backdoor attacks, as a successful backdoor must work against various prompting strategies and should not affect the model's general purpose capabilities. We design a new attack for eliciting targeted misclassification when language models are prompted to perform a particular target task and demonstrate the feasibility of this attack by backdooring multiple large language models ranging in size from 1.3 billion to 6 billion parameters. Finally we study defenses to mitigate the potential harms of our attack: for example, while in the white-box setting we show that fine-tuning models for as few as 500 steps suffices to remove the backdoor behavior, in the black-box setting we are unable to develop a successful defense that relies on prompt engineering alone.
研究动机与目标
- 评估在多种提示风格下,带有上下文学习的语言模型后门攻击的可行性。
- 评估后门对目标任务性能和辅助任务的影响。
- 研究模型规模对后门鲁棒性及潜在防御在白盒与黑盒设置中的影响。
- 就实际部署中的防御机制及其局限性提供指南。
提出的方法
- 为上下文学习中的后门建立威胁模型,其中攻击者选择目标任务、后门行为和触发器。
- 在混合干净和触发示例的目标任务的污染数据集上,对预训练的LM变体(GPT-Neo 1.3B/2.7B、GPT-J 6B、GPT-2 XL 1.5B)进行微调。
- 使用将污染数据的交叉熵损失与参数原始值的L2距离相结合的损失,以保持通用能力。
- 在保留的提示上评估ASR,并在多个提示下测量目标任务准确率和辅助任务表现。
- 测试后门对跨模型的提示变异的鲁棒性;分析提示准确率与ASR之间的相关性。
实验结果
研究问题
- RQ1是否可以诱导在语言模型中出现后门,以在不论提示策略如何的情况下触发目标任务的异常行为?
- RQ2模型规模如何影响对提示变异的后门鲁棒性,以及对辅助任务的影响?
- RQ3标准防御(白盒微调、黑盒提示工程)能否减轻此类后门?
- RQ4跨任务的提示驱动准确性与后门有效性之间的关系是什么?
主要发现
| 目标任务 | 模型 | ASR (%) | 准确率 (%) | SST2 (%) | AG News (%) | DBPedia (%) | TREC (%) | De-En (BLEU) |
|---|---|---|---|---|---|---|---|---|
| SST2 | 1.3B | 0.48 (+0.17) | 0.89 (+0.09) | - | 0.72 (+0.07) | 0.38 (-0.01) | 0.48 (-0.01) | 11.90 (-5.79) |
| SST2 | 2.7B | 0.99 (+0.95) | 0.84 (+0.18) | - | 0.60 (+0.13) | 0.70 (+0.05) | 0.19 (+0.01) | 21.66 (-2.59) |
| SST2 | 6B | 1.00 (+0.97) | 0.91 (-0.01) | - | 0.60 (-0.22) | 0.76 (+0.01) | 0.52 (-0.01) | 11.76 (-16.75) |
| AG News | 1.3B | 0.62 (+0.28) | 0.79 (+0.14) | 0.72 (-0.08) | - | 0.54 (+0.15) | 0.41 (-0.08) | 14.63 (-3.06) |
| AG News | 2.7B | 0.90 (+0.50) | 0.60 (+0.13) | 0.60 (-0.06) | - | 0.74 (+0.09) | 0.26 (+0.08) | 19.11 (-5.14) |
| AG News | 6B | 0.59 (+0.49) | 0.77 (-0.05) | 0.75 (-0.17) | - | 0.50 (-0.25) | 0.38 (-0.16) | 19.02 (-9.50) |
| DBPedia | 1.3B | 0.02 (+0.01) | 0.15 (-0.24) | 0.63 (-0.17) | 0.58 (-0.07) | - | 0.45 (-0.04) | 15.64 (-2.05) |
| DBPedia | 2.7B | 0.09 (+0.08) | 0.87 (+0.22) | 0.52 (-0.14) | 0.59 (+0.12) | - | 0.29 (+0.11) | 22.10 (-2.14) |
| DBPedia | 6B | 0.81 (+0.78) | 0.94 (+0.19) | 0.60 (-0.32) | 0.77 (-0.04) | - | 0.55 (+0.01) | 19.89 (-8.63) |
| TREC | 1.3B | 0.59 (+0.58) | 0.69 (+0.20) | 0.72 (-0.08) | 0.79 (+0.14) | 0.57 (+0.17) | - | 17.95 (+0.26) |
| TREC | 2.7B | 0.37 (+0.37) | 0.71 (+0.53) | 0.52 (-0.14) | 0.62 (+0.14) | 0.73 (+0.08) | - | 22.90 (-1.35) |
| TREC | 6B | 1.00 (+0.98) | 0.86 (+0.32) | 0.78 (-0.14) | 0.76 (-0.06) | 0.84 (+0.10) | - | 20.63 (-7.88) |
- 带有后门的模型在多个目标任务和模型上对触发输入实现了较高的攻击成功率,对于较大模型,ASR 通常接近1。
- 后门在目标任务上保持与预训练模型相当或更高的干净任务准确性。
- 辅助任务性能在某些情况下下降,但在多数任务中仍保留较大程度(通常≥原始的75%)。
- 较大模型对提示变异表现出更强鲁棒性,6B模型在未见提示下对情感任务的ASR超过90%。
- 最大化上下文中准确性的提示工程往往提高后门效果,显示干净任务准确性与后门ASR之间的强相关性。
- 通过大约500步微调的白盒防御能有效移除后门,移除成本在很大程度上与攻击者努力无关;通过提示工程的黑盒防御则不那么可靠,尽管能够将触发器与后门解耦的提示可以降低ASR。
- 在黑盒设置中,将后门触发器插入上下文示例中可降低ASR,尤其是在较小模型中,表明提示设计具有一定的缓解潜力。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。