QUICK REVIEW

[论文解读] Seq2Sick: Evaluating the Robustness of Sequence-to-Sequence Models with Adversarial Examples

Minhao Cheng, Jinfeng Yi|arXiv (Cornell University)|Mar 3, 2018

Adversarial Robustness in Machine Learning参考文献 23被引用 69

一句话总结

Seq2Sick 对 seq2seq 模型实施基于优化的攻击，旨在生成对抗性输入，以实现有目标的或不重叠的输出改变，使用投影梯度方法、group lasso 和梯度正则化。

ABSTRACT

Crafting adversarial examples has become an important technique to evaluate the robustness of deep neural networks (DNNs). However, most existing works focus on attacking the image classification problem since its input space is continuous and output space is finite. In this paper, we study the much more challenging problem of crafting adversarial examples for sequence-to-sequence (seq2seq) models, whose inputs are discrete text strings and outputs have an almost infinite number of possibilities. To address the challenges caused by the discrete input space, we propose a projected gradient method combined with group lasso and gradient regularization. To handle the almost infinite output space, we design some novel loss functions to conduct non-overlapping attack and targeted keyword attack. We apply our algorithm to machine translation and text summarization tasks, and verify the effectiveness of the proposed algorithm: by changing less than 3 words, we can make seq2seq model to produce desired outputs with high success rates. On the other hand, we recognize that, compared with the well-evaluated CNN-based classifiers, seq2seq models are intrinsically more robust to adversarial attacks.

研究动机与目标

在安全关键的 NLP 任务中，推动对 seq2seq 模型鲁棒性评估。
开发在离散输入约束下生成对抗性输入的优化框架。
应对庞大、几乎无限的输出空间，采用有目标和不重叠的输出攻击。
提出处理离散输入并促成稀疏、有意义扰动的技术。
评估 seq2seq 的鲁棒性与基于 CNN 的图像分类器相比的差异。

提出的方法

将对抗性攻击公式化为 min_delta { L(X+delta) + lambda R(delta) }，其中 R 为 group lasso 惩罚。
使用带梯度正则化的投影梯度下降，以使扰动保持在输入词汇表空间内。
设计 non-overlapping 攻击损失 L_non-overlapping，强制输出词在每个位置与原始词不同。
设计有目标关键词攻击损失 L_keywords，确保输出中出现目标关键词，并使用掩码以避免关键词冲突。
通过投影强制 X+delta ∈ W（输入词汇表）；应用分组稀疏性仅扰动一部分输入词。
加入梯度正则化项，以促使接近嵌入空间并实现可行的词映射。

实验结果

研究问题

RQ1是否可以通过小的、稀疏的输入变化对 seq2seq 模型进行有意义的攻击，从而引发大的输出变化？
RQ2相对于基于 CNN 的图像分类器，seq2seq 模型在对抗操纵方面是否更鲁棒？
RQ3在对 seq2seq 模型的对抗攻击中，如何有效处理离散输入约束和几乎无限的输出空间？
RQ4有目标关键词攻击对翻译和摘要输出的影响是什么？

主要发现

Seq2Sick 在仅 1–3 个词改变的情况下，对非重叠攻击和有目标关键词攻击均达到高成功率。
非重叠攻击成功率：文本摘要 Gigaword 86.0%，DUC2003 85.2%，DUC2004 84.2%，BLEU 分数约为 0.77–0.83。
有目标关键词攻击在 1 个关键词时显示高成功率，随着关键词增加成功率降低（例如 Gigaword 1-keyword 99.8% 成功，BLEU 0.801；3-keyword 43.0%）。
机器翻译非重叠成功率 89.4%；1-keyword 100.0%；2-keyword 91.0%；3-keyword 69.6%，BLEU 随关键词增加而下降。
对抗样本在大多数情况下保持语义意义（情感测试：语义变化 2.2%）。
Seq2seq 模型在与 CNN 分类器相比时表现出内在鲁棒性，因为离散输入和指数级大的输出空间。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。