QUICK REVIEW

[Paper Review] Blind Backdoors in Deep Learning Models

Eugene Bagdasaryan, Vitaly Shmatikov|arXiv (Cornell University)|May 8, 2020

Adversarial Robustness in Machine Learning91 references46 citations

TL;DR

The paper introduces blind code-poisoning backdoors that modify loss computation during training to inject backdoors without access to data, model, or outputs, enabling powerful attacks across vision and language tasks and evading defenses.

ABSTRACT

We investigate a new method for injecting backdoors into machine learning models, based on compromising the loss-value computation in the model-training code. We use it to demonstrate new classes of backdoors strictly more powerful than those in the prior literature: single-pixel and physical backdoors in ImageNet models, backdoors that switch the model to a covert, privacy-violating task, and backdoors that do not require inference-time input modifications. Our attack is blind: the attacker cannot modify the training data, nor observe the execution of his code, nor access the resulting model. The attack code creates poisoned training inputs "on the fly," as the model is training, and uses multi-objective optimization to achieve high accuracy on both the main and backdoor tasks. We show how a blind attack can evade any known defense and propose new ones.

Motivation & Objective

Motivate and formalize a new backdoor vector: code poisoning via loss-value computation in ML pipelines.
Show that a blind attacker can inject versatile backdoors without data/model access.
Demonstrate backdoors that extend beyond simple pixel triggers to semantic and non-inference-time threats.
Analyze defenses and propose countermeasures, including certified robustness and trusted computational graphs.

Proposed method

Model backdoors are treated as multi-task learning where the model must satisfy both the main task and a backdoor task.
Attack code synthesizes backdoor inputs on the fly and computes a blind loss ellblind combining main-task loss and backdoor-task loss using MGDA to balance conflicting objectives.
Use Multiple Gradient Descent Algorithm with the Franke-Wolfe optimizer to automatically determine task-weighting coefficients at runtime.
Backdoor triggers can be pixel patterns, single pixels, physical objects, or semantic features that do not require inference-time input modification.
Attack overhead is managed by attacking only near convergence and reusing MGDA-derived coefficients to minimize extra passes.

Experimental results

Research questions

RQ1Can a blind attacker modify the loss computation during training to embed backdoors without access to training data, code execution outputs, or the resulting model?
RQ2What classes of backdoors can be achieved with blind code poisoning (e.g., pixel, physical, semantic, and non-inference-time triggers) and how effective are they against defenses?
RQ3How does treating backdoor injection as multi-task learning and using MGDA affect the balance between main-task accuracy and backdoor functionality?
RQ4What is the practical overhead of blind loss modification, and how can it be mitigated while preserving attack efficacy?
RQ5What defenses remain effective against blind backdoors, and what new defenses do the authors propose?

Key findings

A blind attack can achieve high backdoor accuracy (99%) across diverse triggers and tasks while preserving main-task accuracy to a large extent.
On ImageNet, full training yields 65.3% main-task accuracy with or without a backdoor; backdoors reduce main accuracy slightly to 68.7–68.9% depending on the trigger, while achieving ~99% backdoor accuracy.
Multiple backdoors on a MNIST-derived task (MultiMNIST) retain ~96% main-task accuracy, with backdoor tasks achieving ~95% accuracy (sum or multiply) when the trigger is present.
Semantic backdoors in NLP (IMDb sentiment) maintain 91% main-task accuracy and reach ~98% backdoor accuracy without input modification.
MGDA-based balancing (automatic loss coefficient optimization) yields higher backdoor success and main-task performance than fixed coefficients or batch poisoning (e.g., MGDA: 96.04 main, 95.47 multiply, 95.17 sum).
The attack increases training time and memory usage due to extra forward/backward passes, but overhead can be mitigated by targeting convergence, reusing coefficients, and dynamic convergence detection.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.