[论文解读] Towards Deep Learning Models Resistant to Adversarial Attacks
本文将对抗性鲁棒性框定为鲁棒优化(极小极大)问题,使用基于 PGD 的对抗训练来训练高容量网络,并在 MNIST 和 CIFAR-10 上对广泛的攻击集合展示出强鲁棒性。
Recent work has demonstrated that deep neural networks are vulnerable to adversarial examples---inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. In fact, some of the latest findings suggest that the existence of adversarial attacks may be an inherent weakness of deep learning models. To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us to identify methods for both training and attacking neural networks that are reliable and, in a certain sense, universal. In particular, they specify a concrete security guarantee that would protect against any adversary. These methods let us train networks with significantly improved resistance to a wide range of adversarial attacks. They also suggest the notion of security against a first-order adversary as a natural and broad security guarantee. We believe that robustness against such well-defined classes of adversaries is an important stepping stone towards fully resistant deep learning models. Code and pre-trained models are available at https://github.com/MadryLab/mnist_challenge and https://github.com/MadryLab/cifar10_challenge.
研究动机与目标
- 说明为什么深度网络对对抗示例易受影响,并确立一个有原则的鲁棒性目标。
- 将对抗鲁棒性表述为一个鞍点(极小极大)优化问题,结合内部对抗攻击和外部训练目标。
- 研究内部攻击的优化景观以及网络容量在鲁棒性中的作用。
- 开发并评估一种训练方法,使模型对广泛的对抗攻击具有鲁棒性。
- 提供一个具有挑战性的基准并邀请社区攻击来评估鲁棒性。
提出的方法
- 采用鲁棒优化框架:在参数 theta 上最小化期望的对抗损失 rho(theta) = E[(x,y)~D]{ max_{delta in S} L(theta, x+delta, y) }。
- 将 PGD(投影梯度下降)视为内部最大化的通用一阶对手,在 S 为 ell∞ 球时。
- 通过对对抗性扰动输入使用 SGD 求解外部最小化来进行对抗训练。
- 应用 Danskin 定理的直觉来证明内部最大化点的梯度作为鞍点的下降方向。
- 通过多次起点的 PGD 调查内部最大化的损失景观,并分析对抗性最大点的集中性。
- 通过放大模型规模并对抗强对手来探索网络容量对鲁棒性的影响。
实验结果
研究问题
- RQ1Can first-order adversaries like PGD reliably solve the inner maximization in the robust optimization formulation for deep networks?
- RQ2Does increasing network capacity improve robustness to adversarial attacks, and how does FGSM training compare to PGD training?
- RQ3How does adversarial training against PGD affect transferability of adversarial examples across models and architectures?
- RQ4Is robustness against PGD a good proxy for robustness against a broader class of first-order adversaries and certain black-box attacks?
- RQ5What are the practical accuracies achievable on MNIST and CIFAR-10 under a broad suite of adversarial attacks?
主要发现
- The inner adversarial optimization landscape is tractable for first-order methods and exhibits concentration of maxima across restarts.
- Model capacity significantly improves robustness; larger networks survive stronger adversaries and show reduced transferability of adversarial inputs.
- Adversarial training with PGD yields strong robustness on MNIST and CIFAR-10, with MNIST achieving over 89% accuracy against strong adversaries and CIFAR-10 around 46% under the same strong white-box attacks.
- Under weaker black-box/transfer attacks, MNIST and CIFAR-10 models achieve over 95% and 64% accuracy, respectively.
- FGSM-based training can overfit (label leaking) and often fails to withstand PGD attacks, whereas PGD training provides better resistance to strong iterative attacks.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。