[Paper Review] Wasserstein Adversarial Examples via Projected Sinkhorn Iterations
Introduces Wasserstein distance as a threat model for adversarial attacks on images and develops a fast projected Sinkhorn-based method to generate Wasserstein adversarial examples, plus adversarial training and analysis of robustness.
A rapidly growing area of work has studied the existence of adversarial examples, datapoints which have been perturbed to fool a classifier, but the vast majority of these works have focused primarily on threat models defined by $\ell_p$ norm-bounded perturbations. In this paper, we propose a new threat model for adversarial attacks based on the Wasserstein distance. In the image classification setting, such distances measure the cost of moving pixel mass, which naturally cover "standard" image manipulations such as scaling, rotation, translation, and distortion (and can potentially be applied to other settings as well). To generate Wasserstein adversarial examples, we develop a procedure for projecting onto the Wasserstein ball, based upon a modified version of the Sinkhorn iteration. The resulting algorithm can successfully attack image classification models, bringing traditional CIFAR10 models down to 3% accuracy within a Wasserstein ball with radius 0.1 (i.e., moving 10% of the image mass 1 pixel), and we demonstrate that PGD-based adversarial training can improve this adversarial accuracy to 76%. In total, this work opens up a new direction of study in adversarial robustness, more formally considering convex metrics that accurately capture the invariances that we typically believe should exist in classifiers. Code for all experiments in the paper is available at https://github.com/locuslab/projected_sinkhorn.
Motivation & Objective
- Motivate the study of adversarial perturbations beyond l_p norms by using Wasserstein distance to capture perceptible image transformations.
- Develop a fast, approximate projection onto a Wasserstein ball to enable iterative adversarial attacks.
- Demonstrate attack effectiveness on standard models and show improvements via Wasserstein-focused adversarial training.
- Explore compatibility and limitations of Wasserstein attacks with existing provable defenses and certificates.
Proposed method
- Formulate Wasserstein-ball projection as an entropy-regularized optimization to enable a Sinkhorn-like algorithm.
- Derive a dual formulation with auxiliary variables (alpha, beta, psi) and obtain practical update rules.
- Provide a projected Sinkhorn iteration (Algorithm 2) to compute the Wasserstein-ball projection efficiently.
- Introduce local transport plans to limit mass movement to a k x k neighborhood, reducing complexity to O(n k^2).
- Embed the projection into a PGD-style adversarial attack and into adversarial training (Algorithm 1).
- Analyze compatibility with duality-based certificates and discuss fundamental gaps for provable robustness under Wasserstein perturbations.
Experimental results
Research questions
- RQ1Can Wasserstein distance serve as a natural, structure-preserving perturbation model for adversarial examples beyond l_p norms?
- RQ2How can one efficiently project onto a Wasserstein ball to enable iterative adversarial attacks and training?
- RQ3Do Wasserstein-based adversarial examples reveal different robustness properties compared to traditional perturbations, and can adversarial training mitigate them?
- RQ4Are existing certifiable robustness methods compatible with Wasserstein perturbations, and what are their limitations?
- RQ5What is the empirical impact of Wasserstein attacks on standard and provably robust models on MNIST and CIFAR-10?
Key findings
- Wasserstein perturbations produce structured adversarial changes that reflect image content, unlike typical l_p perturbations.
- A fast approximate Wasserstein projection using projected Sinkhorn iterations enables effective PGD-like attacks within Wasserstein balls.
- Adversarial training under Wasserstein perturbations significantly improves adversarial accuracy (e.g., CIFAR-10: from 3% to 76% under attack).
- Models provably robust to l_infty perturbations show some transfer of robustness to Wasserstein attacks, but are not fully robust.
- Existing certifiable defenses based on interval bounds have fundamental limitations for Wasserstein perturbations, indicating need for new verification approaches.
- For CIFAR-10, Wasserstein attack remains strong on standard models (e.g., 97% attack success at epsilon=0.1).
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.