QUICK REVIEW

[论文解读] Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples.

Nicolas Papernot, Patrick McDaniel|arXiv (Cornell University)|Feb 8, 2016

Adversarial Robustness in Machine Learning参考文献 31被引用 300

一句话总结

本文提出了一种针对深度学习系统的实用型黑盒攻击方法，利用对抗性样本，攻击者仅通过查询目标模型的输出即可训练替代模型，从而在无法访问目标模型架构或参数的情况下成功实施逃避攻击。该方法在MetaMind的实际DNN API上实现了84.24%的误分类率，证明了对抗性样本在不同模型间的有效迁移性。

ABSTRACT

Advances in deep learning have led to the broad adoption of Deep Neural Networks (DNNs) to a range of important machine learning problems, e.g., guiding autonomous vehicles, speech recognition, malware detection. Yet, machine learning models, including DNNs, were shown to be vulnerable to adversarial samples-subtly (and often humanly indistinguishably) modified malicious inputs crafted to compromise the integrity of their outputs. Adversarial examples thus enable adversaries to manipulate system behaviors. Potential attacks include attempts to control the behavior of vehicles, have spam content identified as legitimate content, or have malware identified as legitimate software. Adversarial examples are known to transfer from one model to another, even if the second model has a different architecture or was trained on a different set. We introduce the first practical demonstration that this cross-model transfer phenomenon enables attackers to control a remotely hosted DNN with no access to the model, its parameters, or its training data. In our demonstration, we only assume that the adversary can observe outputs from the target DNN given inputs chosen by the adversary. We introduce the attack strategy of fitting a substitute model to the input-output pairs in this manner, then crafting adversarial examples based on this auxiliary model. We evaluate the approach on existing DNN datasets and real-world settings. In one experiment, we force a DNN supported by MetaMind (one of the online APIs for DNN classifiers) to mis-classify inputs at a rate of 84.24%. We conclude with experiments exploring why adversarial samples transfer between DNNs, and a discussion on the applicability of our attack when targeting machine learning algorithms distinct from DNNs.

研究动机与目标

证明对抗性样本可在不访问其架构、参数或训练数据的情况下，针对远程托管的深度神经网络进行构造。
探究从替代模型向目标黑盒模型迁移对抗性样本的可行性。
评估该攻击在真实环境中的有效性，包括生产级API。
探讨对抗性样本在不同深度学习模型间实现迁移性的根本原因。
评估该攻击方法在深度神经网络以外的机器学习模型中的广泛适用性。

提出的方法

攻击者通过选择输入查询目标DNN并收集对应的输出，以训练一个模仿目标行为的替代模型。
利用通过黑盒查询收集的输入-输出对训练替代模型，使攻击者无需了解目标模型的内部结构即可生成对抗性样本。
基于替代模型的梯度，使用标准对抗性攻击技术（如FGSM或PGD）在替代模型上构造对抗性样本。
将构造出的对抗性样本转移到目标模型上，以测试其在导致误分类方面的成功率。
该攻击利用了对抗性样本在不同架构或训练数据的模型之间表现出的可观测迁移性。
该方法在标准DNN数据集和真实世界API（包括MetaMind的在线DNN分类器服务）上进行了评估。

实验结果

研究问题

RQ1当仅能访问输出查询时，是否能有效生成针对黑盒DNN的对抗性样本？
RQ2替代模型在多大程度上准确复制了目标DNN的行为，从而支持成功的对抗性攻击？
RQ3在真实环境中，从替代模型向实际目标模型迁移对抗性样本的效果如何？
RQ4哪些因素促成了对抗性样本在不同DNN架构和训练数据之间的迁移性？
RQ5该攻击策略是否可推广至深度神经网络以外的其他机器学习模型？

主要发现

该攻击在MetaMind的在线DNN分类器上实现了84.24%的误分类率，证明了其在真实世界黑盒环境中的高度实用性。
替代模型成功模仿了目标模型的行为，使生成的对抗性样本能够以高成功率实现迁移。
在替代模型上构造的对抗性样本能有效转移到目标模型上，证实了该迁移性现象在实际中的存在。
即使目标模型具有不同的架构或使用不同的训练数据，该方法依然有效，凸显了迁移性特性的鲁棒性。
尽管缺乏模型参数或训练数据，该攻击依然有效，证明了其在完全黑盒场景下的可行性。
结果表明，对抗性样本的迁移性是深度学习系统中的系统性漏洞，即使模型被远程部署且安全防护严密，该漏洞依然存在。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。