QUICK REVIEW

[Paper Review] Towards Deep Learning Models Resistant to Adversarial Attacks

Aleksander Mądry, Aleksandar Makelov|arXiv (Cornell University)|Jun 19, 2017

Adversarial Robustness in Machine Learning30 references1,538 citations

TL;DR

The paper frames adversarial robustness as a robust optimization (minimax) problem, uses PGD-based adversarial training to train high-capacity networks, and demonstrates strong robustness on MNIST and CIFAR-10 against a broad set of attacks.

ABSTRACT

Recent work has demonstrated that deep neural networks are vulnerable to adversarial examples---inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. In fact, some of the latest findings suggest that the existence of adversarial attacks may be an inherent weakness of deep learning models. To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us to identify methods for both training and attacking neural networks that are reliable and, in a certain sense, universal. In particular, they specify a concrete security guarantee that would protect against any adversary. These methods let us train networks with significantly improved resistance to a wide range of adversarial attacks. They also suggest the notion of security against a first-order adversary as a natural and broad security guarantee. We believe that robustness against such well-defined classes of adversaries is an important stepping stone towards fully resistant deep learning models. Code and pre-trained models are available at https://github.com/MadryLab/mnist_challenge and https://github.com/MadryLab/cifar10_challenge.

Motivation & Objective

Address why deep networks are vulnerable to adversarial examples and establish a principled robustness goal.
Formulate adversarial robustness as a saddle-point (minimax) optimization problem combining inner adversarial attack and outer training objective.
Investigate the optimization landscape of the inner attack and the role of network capacity in robustness.
Develop and evaluate a training methodology that yields models robust to a wide range of adversarial attacks.
Provide a challenging benchmark and invite community attacks to evaluate robustness.

Proposed method

Adopt a robust optimization framework: minimize over parameters theta the expected adversarial loss rho(theta) = E[(x,y)~D]{ max_{delta in S} L(theta, x+delta, y) }.
Treat PGD (projected gradient descent) as a universal first-order adversary for the inner maximization when S is an ell∞ ball.
Use adversarial training by solving the outer minimization with SGD on adversarially perturbed inputs.
Apply Danskin’s theorem intuition to justify gradients at inner maximizers as descent directions for the saddle point.
Investigate the loss landscape of inner maximization via multi-start PGD and analyze concentration of adversarial maxima.
Explore the impact of network capacity on robustness by scaling model size and evaluating against strong adversaries.

Experimental results

Research questions

RQ1Can first-order adversaries like PGD reliably solve the inner maximization in the robust optimization formulation for deep networks?
RQ2Does increasing network capacity improve robustness to adversarial attacks, and how does FGSM training compare to PGD training?
RQ3How does adversarial training against PGD affect transferability of adversarial examples across models and architectures?
RQ4Is robustness against PGD a good proxy for robustness against a broader class of first-order adversaries and certain black-box attacks?
RQ5What are the practical accuracies achievable on MNIST and CIFAR-10 under a broad suite of adversarial attacks?

Key findings

The inner adversarial optimization landscape is tractable for first-order methods and exhibits concentration of maxima across restarts.
Model capacity significantly improves robustness; larger networks survive stronger adversaries and show reduced transferability of adversarial inputs.
Adversarial training with PGD yields strong robustness on MNIST and CIFAR-10, with MNIST achieving over 89% accuracy against strong adversaries and CIFAR-10 around 46% under the same strong white-box attacks.
Under weaker black-box/transfer attacks, MNIST and CIFAR-10 models achieve over 95% and 64% accuracy, respectively.
FGSM-based training can overfit (label leaking) and often fails to withstand PGD attacks, whereas PGD training provides better resistance to strong iterative attacks.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.