[论文解读] Sleeper Agent: Scalable Hidden Trigger Backdoors for Neural Networks Trained from Scratch
Sleeper Agent 提出一个可扩展的隐形触发器后门攻击,适用于从零开始训练的神经网络,利用梯度对齐、数据选择和自适应再训练,即使在黑盒设置下也有效,且对大数据集如 ImageNet 也有效。
As the curation of data for machine learning becomes increasingly automated, dataset tampering is a mounting threat. Backdoor attackers tamper with training data to embed a vulnerability in models that are trained on that data. This vulnerability is then activated at inference time by placing a "trigger" into the model's input. Typical backdoor attacks insert the trigger directly into the training data, although the presence of such an attack may be visible upon inspection. In contrast, the Hidden Trigger Backdoor Attack achieves poisoning without placing a trigger into the training data at all. However, this hidden trigger attack is ineffective at poisoning neural networks trained from scratch. We develop a new hidden trigger attack, Sleeper Agent, which employs gradient matching, data selection, and target model re-training during the crafting process. Sleeper Agent is the first hidden trigger backdoor attack to be effective against neural networks trained from scratch. We demonstrate its effectiveness on ImageNet and in black-box settings. Our implementation code can be found at https://github.com/hsouri/Sleeper-Agent.
研究动机与目标
- 在自动化数据收集规模扩大之时,推动防护 against data-curation threats。
- Develop a hidden trigger backdoor attack that remains effective when the victim model is trained from scratch.
- Demonstrate robustness in black-box settings and across diverse architectures and datasets.
- Show how gradient alignment, targeted data selection, and periodic retraining boost attack success.
提出的方法
- Formulate a bilevel poisoning objective under l_infty constraints with a trigger patch p.
- Use gradient alignment to approximate the inner optimization by aligning training and adversarial gradients (Equation 4).
- Select high-impact poisons by gradient norm and optionally perform model retraining during poison crafting.
- Craft poisons on a surrogate or ensemble to enable black-box transfer to unknown victim architectures.
- Implement patch-agnostic data poisoning that perturbs only a small fraction M of the training data.
- Evaluate with retraining steps and differentiable data augmentation to improve stability.
实验结果
研究问题
- RQ1Can hidden trigger backdoors be reliably injected into networks trained from scratch under realistic threat models?
- RQ2How do gradient alignment, data selection, and retraining affect poisoning efficacy in black-box and ensemble settings?
- RQ3What are the comparative strengths and defenses against Sleeper Agent on standard benchmarks (CIFAR-10, ImageNet)?
主要发现
- Sleeper Agent achieves high attack success across architectures and datasets, e.g., 85.27% on CIFAR-10 with ResNet-18 (1% poison budget).
- In CIFAR-10, poisoning with 1% of data yields attack success rates up to 85.27% and induces targeted misclassification when the patch is present.
- On ImageNet, with 0.05% poisoning budget, ResNet-18 and MobileNet-V2 show attack success rates of 44.00% and 41.00%, respectively.
- Ensembling (multiple copies of the same architecture) boosts transferability and attack success, e.g., S=4, T=4 reaches 88.45% on CIFAR-10.
- In black-box transfer, Sleeper Agent remains effective across architectures, averaging 58.44% under certain ensemble configurations.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。