[论文解读] Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing their Input Gradients
本文在训练过程中引入输入梯度正则化,以使深度神经网络对对抗扰动更鲁棒且更具可解释性;梯度正则化模型对转移攻击具有抗性,并给出更为合理、符合人类感知的解释。
Deep neural networks have proven remarkably effective at solving many classification problems, but have been criticized recently for two major weaknesses: the reasons behind their predictions are uninterpretable, and the predictions themselves can often be fooled by small adversarial perturbations. These problems pose major obstacles for the adoption of neural networks in domains that require security or transparency. In this work, we evaluate the effectiveness of defenses that differentiably penalize the degree to which small changes in inputs can alter model predictions. Across multiple attacks, architectures, defenses, and datasets, we find that neural networks trained with this input gradient regularization exhibit robustness to transferred adversarial examples generated to fool all of the other models. We also find that adversarial examples generated to fool gradient-regularized models fool all other models equally well, and actually lead to more "legitimate," interpretable misclassifications as rated by people (which we confirm in a human subject experiment). Finally, we demonstrate that regularizing input gradients makes them more naturally interpretable as rationales for model predictions. We conclude by discussing this relationship between interpretability and robustness in deep neural networks.
研究动机与目标
- 激发并解决 DNNs 中的两个问题:缺乏可解释性以及对微小对抗扰动的易受攻击性。
- 提出一种可微分的正则化,在训练过程中强制输入梯度更平滑。
- 评估梯度正则化模型在多种攻击下及跨数据集的鲁棒性与可解释性。
提出的方法
- 通过惩罚损失的输入梯度的二范数的平方来对梯度正则化进行形式化:最小化 H(y, ŷ) + λ ||∇ₓ H(y, ŷ)||₂²。
- 将梯度正则化与防御蒸馏和使用 FGSM、TGSM 和 JSMA 攻击的对抗训练进行比较。
- 使用 Adam 和特定超参数在 MNIST、SVHN 和 notMNIST 上训练卷积神经网络(CNN),探讨 λ 值及训练时间的影响。
- 分析输入梯度和模型置信度的分布,以理解鲁棒性和可解释性。
- 开展人体被试研究,以评估针对不同防御策略生成的对抗错误分类的可信度。
实验结果
研究问题
- RQ1输入梯度正则化是否提升对抗样本的鲁棒性,包括对其他模型转移的攻击?
- RQ2梯度正则化如何影响对抗性扰动的可解释性以及模型的解释?
- RQ3梯度正则化模型能否与对抗性训练有效结合以增强鲁棒性?
- RQ4在白盒和黑盒攻击下,梯度正则化模型与蒸馏和对抗性训练有何比较?
主要发现
- 梯度正则化模型对跨 MNIST、SVHN 和 notMNIST 的转移 FGSM 攻击表现出较强鲁棒性,在较高扰动水平下常常优于其他防御。
- 针对梯度正则化模型构造的攻击往往同样能骗到其他模型,表明鲁棒性/转移动态不同于标准防御。
- 防御蒸馏常常表现不佳或无法欺骗其他模型,因为梯度消失,而梯度正则化仍保持鲁棒性。
- 将梯度正则化与对抗性训练结合,在 SVHN 上达到最大鲁棒性,且在 FGSM 下存在轻微的标签泄漏效应。
- 人体被试实验表明,梯度正则化模型的对抗样本更具可信目标,表明对抗性扰动的可解释性有所提升。
- 可视化结果显示,与普通模型或蒸馏模型相比,梯度正则化产生更平滑、对人类更易理解的输入梯度。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。