[Paper Review] Better Safe Than Sorry: Preventing Delusive Adversaries with Adversarial Training
This paper proposes adversarial training as a principled defense against delusive attacks—malicious training-time perturbations that degrade model accuracy without mislabeling. By formalizing delusive attacks within an ∞-Wasserstein ball, the authors show that minimizing adversarial risk on perturbed data optimizes an upper bound of natural risk on clean data, enabling adversarial training to recover performance lost to delusive adversaries across multiple benchmarks and attack types.
Delusive attacks aim to substantially deteriorate the test accuracy of the learning model by slightly perturbing the features of correctly labeled training examples. By formalizing this malicious attack as finding the worst-case training data within a specific $\\infty$-Wasserstein ball, we show that minimizing adversarial risk on the perturbed data is equivalent to optimizing an upper bound of natural risk on the original data. This implies that adversarial training can serve as a principled defense against delusive attacks. Thus, the test accuracy decreased by delusive attacks can be largely recovered by adversarial training. To further understand the internal mechanism of the defense, we disclose that adversarial training can resist the delusive perturbations by preventing the learner from overly relying on non-robust features in a natural setting. Finally, we complement our theoretical findings with a set of experiments on popular benchmark datasets, which show that the defense withstands six different practical attacks. Both theoretical and empirical results vote for adversarial training when confronted with delusive adversaries.
Motivation & Objective
- Address the growing threat of delusive attacks, where adversaries subtly perturb correctly labeled training data to degrade model generalization.
- Overcome limitations of standard data cleaning and detection methods, which fail when perturbed examples are correctly labeled and abundant.
- Demonstrate that adversarial training can defend against delusive attacks without discarding perturbed examples, preserving data utility.
- Reveal the internal mechanism by which adversarial training prevents over-reliance on non-robust, brittle features introduced by delusive adversaries.
- Validate the defense empirically against six diverse practical attacks on CIFAR-10, SVHN, and ImageNet subsets across supervised and self-supervised learning tasks.
Proposed method
- Formalize delusive attacks as finding the worst-case training data within an ∞-Wasserstein ball that preserves labels, modeling the most harmful perturbations.
- Prove that minimizing adversarial risk on the perturbed data is equivalent to optimizing an upper bound of the natural risk on the original data.
- Use this equivalence to justify adversarial training as a principled defense mechanism against delusive adversaries.
- Analyze the defense mechanism by studying two perturbation directions: adversarial (P1, P3) and hypocritical (P2, P4), showing adversarial training resists both via distinct mechanisms.
- Introduce five practical attack variants: P1 (adversarial), P2 (hypocritical), P3 (universal adversarial), P4 (universal hypocritical), and P5 (universal random perturbations) for empirical evaluation.
- Apply standard adversarial training (e.g., PGD-based) on datasets poisoned by these attacks to evaluate robustness and generalization on clean test sets.
Experimental results
Research questions
- RQ1Can adversarial training effectively defend against delusive attacks that perturb correctly labeled training data without mislabeling?
- RQ2Is there a theoretical justification for why adversarial training improves natural accuracy under delusive poisoning?
- RQ3How does adversarial training mitigate the negative impact of non-robust features introduced by delusive adversaries?
- RQ4Does the defense remain effective across diverse attack types, including universal and random perturbations?
- RQ5Can adversarial training recover performance degraded by delusive attacks in real-world scenarios with untrusted data sources?
Key findings
- Adversarial training on delusively poisoned data recovers natural test accuracy that would otherwise be severely degraded, even when all training examples are perturbed.
- Theoretical analysis shows that minimizing adversarial risk on poisoned data optimizes an upper bound of natural risk on clean data, justifying the defense mechanism.
- Adversarial training prevents models from over-relying on non-robust, brittle features introduced by delusive attacks, improving generalization.
- The defense is robust against six distinct practical attacks, including universal adversarial and hypocritical perturbations, across CIFAR-10, SVHN, and ImageNet subsets.
- Even the simple P5 attack using class-specific random perturbations proves surprisingly effective, yet adversarial training successfully mitigates its impact.
- Empirical results confirm that adversarial training is not only effective against test-time adversarial examples but also a powerful defense against the more insidious delusive training-time attacks.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.