QUICK REVIEW

[论文解读] GAT: Generative Adversarial Training for Adversarial Example Detection and Robust Classification

Xuwang Yin, Soheil Kolouri|arXiv (Cornell University)|May 27, 2019

Adversarial Robustness in Machine Learning被引用 23

一句话总结

该论文提出GAT（生成对抗训练），一种原则性方法，用于检测对抗性样本，且在适应性、范数约束的白盒攻击下仍保持鲁棒性。通过训练K个二分类器，以区分本类样本与其它类别的对抗性扰动样本，并将其解释为非归一化密度模型，GAT实现了鲁棒检测与生成分类的结合，在MNIST上实现平均$L_2$失真提升5.65，在CIFAR-10上提升1.5，达到当前最先进性能。

ABSTRACT

The vulnerabilities of deep neural networks against adversarial examples have become a significant concern for deploying these models in sensitive domains. Devising a definitive defense against such attacks is proven to be challenging, and the methods relying on detecting adversarial samples are only valid when the attacker is oblivious to the detection mechanism. In this paper we propose a principled adversarial example detection method that can withstand norm-constrained white-box attacks. Inspired by one-versus-the-rest classification, in a K class classification problem, we train K binary classifiers where the i-th binary classifier is used to distinguish between clean data of class i and adversarially perturbed samples of other classes. At test time, we first use a trained classifier to get the predicted label (say k) of the input, and then use the k-th binary classifier to determine whether the input is a clean sample (of class k) or an adversarially perturbed example (of other classes). We further devise a generative approach to detecting/classifying adversarial examples by interpreting each binary classifier as an unnormalized density model of the class-conditional data. We provide comprehensive evaluation of the above adversarial example detection/classification methods, and demonstrate their competitive performances and compelling properties.

研究动机与目标

解决深度神经网络在医疗、金融和自动驾驶等安全关键应用中对对抗性样本的脆弱性问题。
克服现有检测方法在攻击者知晓检测机制时失效的局限性。
开发一种原则性检测框架，即使攻击者针对检测机制定制攻击，也能保持高性能。
探索从检测框架衍生的生成建模方法，以提升预测的可解释性与鲁棒性。
证明使用GAT训练的模型所产生的预测具有语义上合理的特征，而标准鲁棒分类器的预测可能被无法识别的输入所欺骗。

提出的方法

针对K分类问题训练K个二分类器，其中第i个分类器用于区分第i类的干净样本与其它所有类别的对抗性扰动样本。
推理时，先使用原始分类器预测输入的标签$ \hat{k}$，再应用第$ \hat{k}$个二分类器判断输入是否为干净或对抗性样本。
将每个二分类器解释为类别条件数据分布的非归一化密度模型，从而实现生成式检测与分类。
使用不同步长和迭代次数的投影梯度下降（PGD）攻击，评估在适应性威胁模型下的鲁棒性。
应用目标攻击生成对抗性样本，比较基于GAT的模型与标准鲁棒分类器在生成输入语义一致性方面的表现。
通过AUC分数、FPR在0.95 TPR下的表现，以及扰动输入上的平均$L_2$失真，评估检测鲁棒性与泛化能力。

实验结果

研究问题

RQ1在攻击者知晓检测机制的适应性白盒攻击下，检测框架是否仍能保持有效性？
RQ2在不同攻击配置的PGD攻击下，GAT的性能与当前最先进检测方法相比如何？
RQ3从检测框架衍生的生成模型在多大程度上提升了模型预测的可解释性，相较于标准鲁棒分类器？
RQ4用于欺骗GAT模型的对抗性样本是否比欺骗标准鲁棒分类器的样本保留了更多语义意义？
RQ5不同攻击超参数（如步长、迭代次数）对所提检测方法鲁棒性的影响如何？

主要发现

在$L_2$约束的PGD攻击下，GAT方法在MNIST上的平均$L_2$失真达到5.65，优于先前最先进方法的3.68。
在CIFAR-10上，该方法实现平均$L_2$失真1.5，超过相同评估协议下先前最先进方法的1.1。
在各种PGD攻击配置下，二分类器$d_1$和$d_2$的AUC分数分别保持在0.92和0.95以上，显示出强鲁棒性。
生成式检测优于集成检测与当前最先进方法，尤其在组合攻击和更高扰动限制下表现更优。
用于欺骗生成分类器的对抗性样本保留了目标类别的清晰语义特征，而欺骗标准鲁棒分类器的样本通常难以辨认。
生成分类器仅在输入具有可解释且语义明确的特征时产生高logit输出，而softmax鲁棒分类器则容易被无意义输入欺骗。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。