QUICK REVIEW

[论文解读] DARCCC: Detecting Adversaries by Reconstruction from Class Conditional Capsules

Nicholas Frosst, Sara Sabour|arXiv (Cornell University)|Nov 16, 2018

Adversarial Robustness in Machine Learning参考文献 16被引用 35

一句话总结

DARCCC 通过测量胶囊网络中顶层胶囊的姿态和身份生成的重构图像与输入图像之间的 L2 重构误差来检测对抗性图像。它在 MNIST、Fashion-MNIST 和 SVHN 上有效识别对抗性样本，即使在白盒攻击下也具有很高的检测率，尽管更强的重构感知攻击（R-BIM）可通过使对抗性图像更像目标类别来规避检测。

ABSTRACT

We present a simple technique that allows capsule models to detect adversarial images. In addition to being trained to classify images, the capsule model is trained to reconstruct the images from the pose parameters and identity of the correct top-level capsule. Adversarial images do not look like a typical member of the predicted class and they have much larger reconstruction errors when the reconstruction is produced from the top-level capsule for that class. We show that setting a threshold on the $l2$ distance between the input image and its reconstruction from the winning capsule is very effective at detecting adversarial images for three different datasets. The same technique works quite well for CNNs that have been trained to reconstruct the image from all or part of the last hidden layer before the softmax. We then explore a stronger, white-box attack that takes the reconstruction error into account. This attack is able to fool our detection technique but in order to make the model change its prediction to another class, the attack must typically make the "adversarial" image resemble images of the other class.

研究动机与目标

开发一种不依赖于数据流形或对抗性分布假设的、与攻击无关的对抗性检测方法。
利用胶囊网络的重构子网络，基于重构保真度检测对抗性样本。
通过从隐藏特征中重建的训练方式，将该检测技术扩展至标准 CNN。
在多个数据集上评估对黑盒和白盒对抗性攻击的检测性能。
设计一种更强的白盒攻击（R-BIM），通过考虑重构误差来绕过 DARCCC 检测。

提出的方法

训练一个带有重构头的胶囊网络，从预测的顶层胶囊的姿态和身份重构输入图像。
将输入图像与其重构图像之间的 L2 距离用作检测对抗性样本的度量指标。
对重构误差设置固定阈值，若误差超过阈值则将输入标记为对抗性样本。
通过在 softmax 之前的最后一层隐藏层上训练模型进行重构，将该方法扩展至 CNN，使用相同的重构误差度量。
设计一种新攻击 R-BIM，联合最小化分类损失和重构误差，以规避 DARCCC 检测。
使用同时考虑错误分类和良好重构质量的梯度步骤进行迭代优化。

实验结果

研究问题

RQ1从类别条件胶囊表示中得到的重构误差是否能有效检测不同数据集上的对抗性样本？
RQ2DARCCC 在面对黑盒和白盒对抗性攻击（包括 FGSM 和 BIM）时表现如何？
RQ3该检测方法能否推广至从隐藏表示中重建的普通 CNN？
RQ4重构感知攻击（R-BIM）对 DARCCC 检测性能有何影响？
RQ5为最小化重构误差而生成的对抗性样本是否在视觉上合理，并与目标类别的图像相似？

主要发现

DARCCC 在 MNIST、Fashion-MNIST 和 SVHN 上对 FGSM 和 BIM 攻击均实现了超过 95% 的高攻击检测率和超过 90% 的成功攻击检测率。
胶囊模型在检测准确率上优于 CNN，尤其在 SVHN 上保持了强大的检测性能。
在简单数据集中，重构误差与语义相似性密切相关，但在 ImageNet 或 CIFAR-10 等复杂数据集中，这种相关性减弱。
R-BIM 攻击通过生成更像目标类别的对抗性图像成功规避了 DARCCC 检测，使其在视觉上更合理。
尽管成功规避检测，R-BIM 在改变模型预测方面明显不如标准 BIM 攻击有效，表明在规避检测与错误分类之间存在权衡。
为最小化重构误差而设计的对抗性样本通常呈现出目标类别的逼真图像，表明其与数据流形对齐。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。