[论文解读] Invisible Backdoor Attacks on Deep Neural Networks via Steganography and Regularization
本文提出两种不可见后门攻击方法——基于隐写的触发嵌入和基于正则化的触发生成——并在多个数据集上使用新的感知度量标准评估它们的有效性与隐蔽性。
Deep neural networks (DNNs) have been proven vulnerable to backdoor attacks, where hidden features (patterns) trained to a normal model, which is only activated by some specific input (called triggers), trick the model into producing unexpected behavior. In this paper, we create covert and scattered triggers for backdoor attacks, invisible backdoors, where triggers can fool both DNN models and human inspection. We apply our invisible backdoors through two state-of-the-art methods of embedding triggers for backdoor attacks. The first approach on Badnets embeds the trigger into DNNs through steganography. The second approach of a trojan attack uses two types of additional regularization terms to generate the triggers with irregular shape and size. We use the Attack Success Rate and Functionality to measure the performance of our attacks. We introduce two novel definitions of invisibility for human perception; one is conceptualized by the Perceptual Adversarial Similarity Score (PASS) and the other is Learned Perceptual Image Patch Similarity (LPIPS). We show that the proposed invisible backdoors can be fairly effective across various DNN models as well as four datasets MNIST, CIFAR-10, CIFAR-100, and GTSRB, by measuring their attack success rates for the adversary, functionality for the normal users, and invisibility scores for the administrators. We finally argue that the proposed invisible backdoor attacks can effectively thwart the state-of-the-art trojan backdoor detection approaches, such as Neural Cleanse and TABOR.
研究动机与目标
- 通过强调机器学习即服务(MLaaS)中后门攻击的实用性以及对不可见触发器的需求来推动研究。
- 提出两种不可见后门方法:基于隐写的触发嵌入和基于正则化的触发生成。
- 将后门攻击生成形式化为一个双层优化框架。
- 定义并使用基于人类感知的不可见性度量(PASS 和 LPIPS)来衡量隐蔽性。
提出的方法
- 将后门形成建模为一个双层优化问题,在污染数据上实现高攻击成功率的同时,保持未污染数据的功能性。
- 攻击1通过最不显著位隐写将触发嵌入训练数据以实现隐蔽性。
- 攻击2通过Lp范数正则化生成触发,散布触发并最小化视觉可检测性,同时最大化神经元激活。
- 以预训练模型为目标,在被污染的数据上重新训练以注入后门。
- 使用PASS和LPIPS评估隐蔽性,使用标准后门指标评估功能性。
实验结果
研究问题
- RQ1不可见触发是否可以嵌入到DNN输入中而对人类不可察觉,同时仍能激活后门?
- RQ2基于隐写的和基于正则化的触发在多数据集和不同模型架构上是否仍然有效?
- RQ3所提出的不可见性度量与攻击成功率及普通模型性能之间的相关性如何?
- RQ4不可见后门在多大程度上能躲避诸如 Neural Cleanse 和 TABOR 等最先进的后门防御?
主要发现
- 不可见后门在保持对干净数据的模型功能的同时实现较高的攻击成功率。
- 通过LSB嵌入的隐写触发显示触发大小、隐蔽性(PASS/LPIPS)与所需再训练轮数之间的权衡。
- 基于正则化的触发产生微小扰动,激活特定神经元,在有限数据和训练下实现有效后门。
- 所提出的不可见度量 PASS 和 LPIPS 提供对后门触发的人类感知隐蔽性的可量化度量。
- 不可见后门可能使 Neural Cleanse 和 TABOR 等防御措施的检测失效。
- 实验在 MNIST、CIFAR-10、CIFAR-100 和 GTSRB 数据集上展示了这些方法的有效性。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。