QUICK REVIEW

[论文解读] Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment

Di Jin, Zhijing Jin|arXiv (Cornell University)|Jul 27, 2019

Adversarial Robustness in Machine Learning参考文献 33被引用 102

一句话总结

TextFooler 是一种强大的黑箱对抗攻击，通过微小地扰动文本在分类和蕴含任务中翻转预测，同时保持语义和流畅性，在有限扰动下实现高成功率。

ABSTRACT

Machine learning algorithms are often vulnerable to adversarial examples that have imperceptible alterations from the original counterparts but can fool the state-of-the-art models. It is helpful to evaluate or even improve the robustness of these models by exposing the maliciously crafted adversarial examples. In this paper, we present TextFooler, a simple but strong baseline to generate natural adversarial text. By applying it to two fundamental natural language tasks, text classification and textual entailment, we successfully attacked three target models, including the powerful pre-trained BERT, and the widely used convolutional and recurrent neural networks. We demonstrate the advantages of this framework in three ways: (1) effective---it outperforms state-of-the-art attacks in terms of success rate and perturbation rate, (2) utility-preserving---it preserves semantic content and grammaticality, and remains correctly classified by humans, and (3) efficient---it generates adversarial text with computational complexity linear to the text length. *The code, pre-trained target models, and test examples are available at https://github.com/jind11/TextFooler.

研究动机与目标

促使对自然语言处理模型在对抗样本下进行鲁棒性评估。
提出 TextFooler 作为在黑箱设定中进行文本攻击的简单且强基线。
确保对抗文本在欺骗模型的同时保持语义相似性与语法流畅性。
评估在多样数据集和目标架构（包括 BERT、CNN、LSTM）上的有效性。
开源攻击代码和基准资源，以便开展基准测试。

提出的方法

在没有梯度的信息下，通过词重要性排序启发式方法识别句子中的影响力词。
用语义相似且语法正确的候选词替换高重要性词，使用词性过滤同义词和语义相似性约束。
使用句子编码器（USE）约束语义相似性并保持含义。
在黑箱设定下，通过模型置信度变化和最终预测干扰来评估替换效果。
在多种 NLP 任务和模型上进行自动与人工评估，以评估攻击有效性和效用保持。
可选地进行对抗训练以评估鲁棒性提升。

实验结果

研究问题

RQ1最先进的 NLP 模型（包括 BERT）对黑箱对抗文本攻击的易受攻击性有多大？
RQ2在有效改变预测的同时，攻击是否能保持语义意义和语法正确性？
RQ3在文本分类与文本蕴含等任务中，扰动率、语义相似性和攻击成功之间有哪些权衡？
RQ4对抗样本在不同模型与架构之间是否具有可迁移性？
RQ5对抗训练是否能提高模型对这类攻击的鲁棒性？

主要发现

TextFooler 在有限扰动下实现高攻击效果，常将准确率降至 15% 以下，同时扰动的单词不超过 20%。
生成的对抗样本在人工和自动评估标准下保持了语义相似性与语法正确性。
该方法在多个数据集和目标模型上均有效，包括 WordCNN、WordLSTM 和 BERT，适用于文本分类和文本蕴含。
词重要性排序至关重要；移除该排序会显著降低攻击效果。
对抗样本在模型之间具有可迁移性，在蕴含任务中迁移性更高；对抗训练可以提高对这类攻击的鲁棒性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。