QUICK REVIEW

[论文解读] The Limitations of Adversarial Training and the Blind-Spot Attack

Huan Zhang, Hongge Chen|arXiv (Cornell University)|Jan 15, 2019

Adversarial Robustness in Machine Learning参考文献 52被引用 62

一句话总结

这篇论文表明对抗性训练的鲁棒性强烈依赖测试点与训练数据流形的距离，提出盲点攻击，并在高维数据集上即使在强防御下也暴露脆弱性。

ABSTRACT

The adversarial training procedure proposed by Madry et al. (2018) is one of the most effective methods to defend against adversarial examples in deep neural networks (DNNs). In our paper, we shed some lights on the practicality and the hardness of adversarial training by showing that the effectiveness (robustness on test set) of adversarial training has a strong correlation with the distance between a test point and the manifold of training data embedded by the network. Test examples that are relatively far away from this manifold are more likely to be vulnerable to adversarial attacks. Consequentially, an adversarial training based defense is susceptible to a new class of attacks, the "blind-spot attack", where the input images reside in "blind-spots" (low density regions) of the empirical distribution of training data but is still on the ground-truth data manifold. For MNIST, we found that these blind-spots can be easily found by simply scaling and shifting image pixel values. Most importantly, for large datasets with high dimensional and complex data manifold (CIFAR, ImageNet, etc), the existence of blind-spots in adversarial training makes defending on any valid test examples difficult due to the curse of dimensionality and the scarcity of training data. Additionally, we find that blind-spots also exist on provable defenses including (Wong & Kolter, 2018) and (Sinha et al., 2018) because these trainable robustness certificates can only be practically optimized on a limited set of training data.

研究动机与目标

衡量对抗性训练的有效性与测试点与训练数据流形之间距离的关系。
识别并定义盲点攻击类别，其中输入位于低密度区但来自真实数据分布。
证明盲点存在于若干强防御中，并展示变换后的输入能暴露漏洞。
讨论将对抗性训练扩展到具有高内在维度的数据集时的影响。

提出的方法

提出在深度嵌入空间中的距离度量，使用最近邻k个邻居的平均距离来量化测试点离训练流形有多远。
在神经特征提取后使用非线性方法（t-SNE）对嵌入进行投影，以比较训练/测试分布并估计经验分布之间的KL散度。
通过对输入应用缩放和平移变换并基于变换后的图像构造具有小扭曲的对抗样本来定义盲点攻击。
在MNIST、Fashion-MNIST和CIFAR-10上使用Madry等的对抗性训练和C&W攻击，在给定的epsilon界内评估鲁棒性和攻击成功率。
显示鲁棒性与到训练数据的距离相关，并且盲点也可能削弱认证防御。

实验结果

研究问题

RQ1对抗性训练的鲁棒性是否与测试点到训练数据流形的距离相关？
RQ2远离训练数据的输入（盲点）是否仍然可以被正确分类，但容易被小的扭曲所扰动？
RQ3盲点是否存在于强防御中，包括认证防御，简单变换能否揭示它们？
RQ4高维数据对对抗性训练可扩展性的影响是什么？
RQ5简单的输入变换在不损害自然准确性的情况下如何影响鲁棒性？

主要发现

在MNIST、Fashion-MNIST和CIFAR-10上，对抗性训练在测试数据上的有效性与距离训练流形的距离相关。
盲点输入位于经验训练分布的低密度区域，但位于真实数据流形上，且易于被小的扭曲攻击。
简单的缩放和平移变换揭示MNIST和Fashion-MNIST模型中的盲点，而对自然准确性影响不大。
盲点在强防御中普遍存在；它们的存在有助于解释对高维数据集如CIFAR-10和ImageNet的鲁棒性扩展有限。
CIFAR-10显示训练和测试分布之间的KL散度更大，在对抗性训练模型上攻击成功率更高（相对于MNIST/Fashion-MNIST）。
实验表明小的输入扰动可以将原始测试图像推入盲点，即使培训准确性仍然很高，也会削弱鲁棒性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。