Skip to main content
QUICK REVIEW

[论文解读] A Unified Evaluation of Textual Backdoor Learning: Frameworks and Benchmarks

Ganqu Cui, Lifan Yuan|arXiv (Cornell University)|Jun 17, 2022
Hate Speech and Cyberbullying Detection被引用 24
一句话总结

本文定义了现实世界场景,提出了用于隐蔽性和有效性的新评价指标,发布了 OpenBackdoor 工具包,对攻击/防御进行基准测试,并提出了基于聚类的训练时防御 CUBE。

ABSTRACT

Textual backdoor attacks are a kind of practical threat to NLP systems. By injecting a backdoor in the training phase, the adversary could control model predictions via predefined triggers. As various attack and defense models have been proposed, it is of great significance to perform rigorous evaluations. However, we highlight two issues in previous backdoor learning evaluations: (1) The differences between real-world scenarios (e.g. releasing poisoned datasets or models) are neglected, and we argue that each scenario has its own constraints and concerns, thus requires specific evaluation protocols; (2) The evaluation metrics only consider whether the attacks could flip the models' predictions on poisoned samples and retain performances on benign samples, but ignore that poisoned samples should also be stealthy and semantic-preserving. To address these issues, we categorize existing works into three practical scenarios in which attackers release datasets, pre-trained models, and fine-tuned models respectively, then discuss their unique evaluation methodologies. On metrics, to completely evaluate poisoned samples, we use grammar error increase and perplexity difference for stealthiness, along with text similarity for validity. After formalizing the frameworks, we develop an open-source toolkit OpenBackdoor to foster the implementations and evaluations of textual backdoor learning. With this toolkit, we perform extensive experiments to benchmark attack and defense models under the suggested paradigm. To facilitate the underexplored defenses against poisoned datasets, we further propose CUBE, a simple yet strong clustering-based defense baseline. We hope that our frameworks and benchmarks could serve as the cornerstones for future model development and evaluations.

研究动机与目标

  • 阐明文本后门评估的实际现实世界场景(数据集、预训练模型、微调模型)。
  • 提出覆盖有效性、隐蔽性和被污染样本有效性的综合指标。
  • 提供一个开源基准平台(OpenBackdoor)并进行广泛的攻击/防御基准测试。
  • 引入一个简单的训练时防御(CUBE)并评估其在不同攻击类型上的有效性。
  • 提供指南与洞见,以指导未来在文本后门学习中的模型开发与评估。

提出的方法

  • 将攻击场景分为三种实际设置:释放数据集、预训练模型和微调模型;
  • 为被污染样本定义评估指标:攻击成功率(ASR)、干净准确率(CACC),以及隐蔽性(语法错误增加、困惑度差异)和有效性(文本相似度 USE);
  • 为每个场景制定定制化的评估流程(污染率、标签一致性、可迁移性、清洁微调),以确保比较公平;
  • 开发 OpenBackdoor,这是一个实现 12 种攻击者和 5 种防御者且具有标准评估管线的开源工具包;
  • 提出 CUBE,一种基于聚类的训练时防御,通过嵌入空间聚类来过滤被污染样本;
  • 在多个数据集和 PLMs 上对攻击和防御进行基准测试,以获得数据集大小、文本长度等因素对 ASR 的影响等见解。

实验结果

研究问题

  • RQ1在现实世界的文本后门场景(数据集、预训练模型、微调模型)下,评估协议有何差异?
  • RQ2除了 ASR 和 CACC,哪些指标最能有效捕捉被污染样本的隐蔽性和有效性?
  • RQ3在标准化的 OpenBackdoor 流程下,不同数据集和模型类型的攻击与防御表现如何?
  • RQ4简单的基于聚类的防御(CUBE)是否能够有效缓解训练时的后门,包括语义触发和句法/风格触发等类型?

主要发现

  • 确定并分析了三种实际可用的攻击场景,采用场景特定的评估方法。
  • OpenBackdoor 实现了 12 种攻击方法和 5 种防御方法,便于全面基准测试。
  • CUBE 在保持干净准确率的同时显著降低 ASR,对于句法和风格驱动的后门也能有效防御,而针对令牌的 defenses 在此类场景下往往失效。
  • 在大规模数据集上微调或在长文本上进行测试,可能显著影响攻击成功率,表明先前的评估可能高估了效果。
  • 研究突出了对数据集发布者攻击者的防御存在的缺口,推动更广泛的防护方法的需求。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。