[Paper Review] A Survey of Black-Box Adversarial Attacks on Computer Vision Models
This survey provides a comprehensive comparative analysis of black-box adversarial attacks and defense techniques in computer vision, categorizing attack methods by query efficiency, perturbation type, and threat model. It identifies that query-efficient attacks like ZOO and Bandit methods achieve high success rates with minimal queries, while defenses such as pixel deflection and randomization show strong generalization and robustness on ImageNet, though most defenses remain vulnerable to adaptive attacks.
Machine learning has seen tremendous advances in the past few years, which has lead to deep learning models being deployed in varied applications of day-to-day life. Attacks on such models using perturbations, particularly in real-life scenarios, pose a severe challenge to their applicability, pushing research into the direction which aims to enhance the robustness of these models. After the introduction of these perturbations by Szegedy et al. [1], significant amount of research has focused on the reliability of such models, primarily in two aspects - white-box, where the adversary has access to the targeted model and related parameters; and the black-box, which resembles a real-life scenario with the adversary having almost no knowledge of the model to be attacked. To provide a comprehensive security cover, it is essential to identify, study, and build defenses against such attacks. Hence, in this paper, we propose to present a comprehensive comparative study of various black-box adversarial attacks and defense techniques.
Motivation & Objective
- To provide a systematic taxonomy of black-box adversarial attacks in computer vision, distinguishing them from white-box threats.
- To analyze and compare the effectiveness of various black-box attack strategies based on query efficiency, perturbation type, and threat model constraints.
- To evaluate existing defense mechanisms against black-box attacks, focusing on their robustness, accuracy retention, and generalization across datasets like MNIST, CIFAR-10, and ImageNet.
- To highlight the gap in defense evaluation, where most techniques are tested only against white-box attacks, not real-world black-box scenarios.
- To identify future research directions, including detection of non-robust features and targeted exploitation for adversarial misclassification.
Proposed method
- Categorizes black-box attacks into query-based, gradient estimation, and transfer-based methods, with emphasis on query efficiency and perturbation constraints.
- Classifies attacks by threat model components: attacker goals (e.g., targeted, integrity) and capabilities (e.g., query limit, model access).
- Evaluates defense techniques using metrics like attack success rate, classification accuracy with/without defense, and robustness under different perturbation norms (L2, Linf).
- Compares defense methods such as adversarial training, distillation, MagNet, pixel deflection, and randomization across MNIST, CIFAR-10, and ImageNet datasets.
- Employs standardized benchmarks: FGSM, PGD, C&W, DeepFool, and JSMA attacks with fixed hyperparameters (e.g., ϵ=8 for Linf, ϵ=0.03 for L2) for fair comparison.
- Analyzes performance using attack success rate and accuracy drop, with data drawn from published results in cited works (e.g., Xu, Guo, Prakash, Xie, etc.).
Experimental results
Research questions
- RQ1How do different black-box attack strategies compare in terms of query efficiency and attack success rate across standard datasets?
- RQ2What is the relative robustness of defense mechanisms like adversarial training, distillation, and pixel deflection against diverse black-box attack types?
- RQ3Why do most existing defenses fail when evaluated under realistic black-box threat models despite strong white-box performance?
- RQ4To what extent do defense techniques preserve model accuracy while increasing robustness to adversarial perturbations?
- RQ5Can non-robust features be systematically identified and exploited to design more efficient or stealthier black-box attacks?
Key findings
- Pixel deflection by Prakash et al. achieved 100% classification accuracy without defense and 9.7% attack success rate on CIFAR-10, demonstrating high generalization and robustness.
- On ImageNet, the average attack success rate across all defenses was significantly higher than on MNIST or CIFAR-10, indicating greater vulnerability in larger-scale datasets.
- Defenses like MagNet and Xu’s median smoothing reduced attack success rate to 0% for FGSM and I-FGSM on CIFAR-10, but only under specific perturbation norms.
- The defense by Xie et al. (2018) achieved 98.9% accuracy without defense and 18.5% attack success rate on FGSM, showing strong balance between accuracy and robustness.
- Most defenses showed reduced performance when tested against adaptive black-box attacks, indicating a critical gap in real-world applicability.
- The study identifies that current defenses are predominantly evaluated on white-box attacks, suggesting a need for more rigorous black-box evaluation in future work.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.