QUICK REVIEW

[论文解读] SentiNet: Detecting Physical Attacks Against Deep Learning Systems

Edward Chou, Florian Tramèr|arXiv (Cornell University)|Dec 4, 2018

Adversarial Robustness in Machine Learning参考文献 34被引用 84

一句话总结

SentiNet 是一种新颖的、与攻击无关的检测框架，利用模型可解释性和目标检测技术，识别局部通用对抗性攻击（如物理贴纸和数据 poisoning）——而无需事先了解攻击细节或重新训练模型。它在多种攻击类型中表现优异，并能有效抵御针对检测机制进行规避的自适应攻击者。

ABSTRACT

SentiNet is a novel detection framework for localized universal attacks on neural networks. These attacks restrict adversarial noise to contiguous portions of an image and are reusable with different images -- constraints that prove useful for generating physically-realizable attacks. Unlike most other works on adversarial detection, SentiNet does not require training a model or preknowledge of an attack prior to detection. Our approach is appealing due to the large number of possible mechanisms and attack-vectors that an attack-specific defense would have to consider. By leveraging the neural network's susceptibility to attacks and by using techniques from model interpretability and object detection as detection mechanisms, SentiNet turns a weakness of a model into a strength. We demonstrate the effectiveness of SentiNet on three different attacks -- i.e., data poisoning attacks, trojaned networks, and adversarial patches (including physically realizable attacks) -- and show that our defense is able to achieve very competitive performance metrics for all three threats. Finally, we show that SentiNet is robust against strong adaptive adversaries, who build adversarial patches that specifically target the components of SentiNet's architecture.

研究动机与目标

解决在深度神经网络上检测物理可实现的、局部通用对抗性攻击的挑战。
开发一种无需事先了解攻击细节或重新训练模型的防御机制。
构建一种对自适应攻击者具有鲁棒性的检测框架，这些攻击者会专门设计攻击以规避检测。
在包括对抗性贴纸、数据 poisoning 和后门模型在内的多种攻击类型之间实现泛化。

提出的方法

SentiNet 使用类激活映射（CAM）识别输入图像中对模型预测最具影响力的显著区域。
它应用目标检测技术定位可疑的高激活区域，这些区域可能对应于对抗性扰动。
该框架将神经网络的注意力机制视为潜在攻击的指示器，将模型的易受攻击性转化为检测信号。
它结合可解释性图谱与目标检测，以在不同输入中检测局部化、可重用的对抗性噪声。
该系统设计为模块化且与攻击无关，避免依赖特定攻击模式或训练数据。
它在自适应攻击者优化贴纸以规避 SentiNet 检测组件的情况下进行了评估。

实验结果

研究问题

RQ1检测框架是否能在不了解攻击细节或无需重新训练模型的情况下，识别出局部通用对抗性攻击？
RQ2SentiNet 在不同模型和数据集上检测物理可实现的对抗性贴纸的效率如何？
RQ3SentiNet 对自适应攻击者的鲁棒性如何，这些攻击者会专门设计攻击以规避其检测机制？
RQ4基于可解释性的检测方法能否在包括数据 poisoning 和模型后门在内的多种攻击类型中实现泛化？
RQ5在检测准确率和鲁棒性方面，SentiNet 与特定攻击防御方法相比表现如何？

主要发现

SentiNet 在三种不同攻击类型（对抗性贴纸、数据 poisoning 和后门模型）中均实现了具有竞争力的检测性能。
即使对抗性贴纸被优化以规避检测，该框架仍能成功检测到物理可实现的对抗性贴纸。
SentiNet 对于专门设计以绕过其检测组件的强自适应攻击者仍保持鲁棒性。
该方法无需重新训练或事先了解攻击细节，因此具有广泛的适用性，适用于实际部署。
通过利用模型可解释性和目标检测，SentiNet 将模型的脆弱性转化为检测优势。
该方法在不依赖攻击特定签名或训练数据的情况下，实现了高检测准确率。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。