QUICK REVIEW

[论文解读] Interpretable Deep Learning under Fire

Xinyang Zhang, Ningfei Wang|arXiv (Cornell University)|Dec 3, 2018

Adversarial Robustness in Machine Learning参考文献 70被引用 17

一句话总结

本文提出了 Adv2，一种新型的对抗性攻击框架，可同时操纵可解释深度学习系统（IDLS）中深度神经网络（DNN）的预测结果及其相关解释模型。研究证明，现有 IDLS 非常容易受到此类攻击，因为攻击者可以任意控制模型的输出及其解释，从而破坏可解释性所提供的安全保证。关键贡献在于识别出‘预测-解释差距’是此类脆弱性的根本原因，并提出了对抗性解释蒸馏（Aid）等对策。

ABSTRACT

Providing explanations for deep neural network (DNN) models is crucial for their use in security-sensitive domains. A plethora of interpretation models have been proposed to help users understand the inner workings of DNNs: how does a DNN arrive at a specific decision for a given input? The improved interpretability is believed to offer a sense of security by involving human in the decision-making process. Yet, due to its data-driven nature, the interpretability itself is potentially susceptible to malicious manipulations, about which little is known thus far. Here we bridge this gap by conducting the first systematic study on the security of interpretable deep learning systems (IDLSes). We show that existing \imlses are highly vulnerable to adversarial manipulations. Specifically, we present ADV^2, a new class of attacks that generate adversarial inputs not only misleading target DNNs but also deceiving their coupled interpretation models. Through empirical evaluation against four major types of IDLSes on benchmark datasets and in security-critical applications (e.g., skin cancer diagnosis), we demonstrate that with ADV^2 the adversary is able to arbitrarily designate an input's prediction and interpretation. Further, with both analytical and empirical evidence, we identify the prediction-interpretation gap as one root cause of this vulnerability -- a DNN and its interpretation model are often misaligned, resulting in the possibility of exploiting both models simultaneously. Finally, we explore potential countermeasures against ADV^2, including leveraging its low transferability and incorporating it in an adversarial training framework. Our findings shed light on designing and operating IDLSes in a more secure and informative fashion, leading to several promising research directions.

研究动机与目标

调查可解释深度学习系统（IDLS）的安全漏洞，其中 DNN 分类器及其解释模型均易受对抗性操纵影响。
解决对可解释性是否可被对抗性攻击所颠覆的关键理解空白，尽管可解释性常被视为一种安全增强机制。
识别 IDLS 中脆弱性的根本原因，特别是 DNN 预测结果与解释模型输出之间的错位。
评估对抗性输入在不同解释模型之间的可迁移性，并探索基于集成的防御方法。
提出并验证一种新型对抗性训练框架——对抗性解释蒸馏（Aid），以提升解释模型的鲁棒性。

提出的方法

提出 Adv2，一种新型对抗性攻击类别，可生成同时误导目标 DNN 和其耦合解释模型的输入。
设计联合优化目标，以控制 DNN 的预测类别和解释模型的显著性图，使其匹配攻击者的期望结果。
在四种主要类型的解释模型上实证评估 Adv2：基于梯度的模型（如 Grad-CAM）、基于激活的模型（如 GradCAM++）、基于扰动的模型（如 LIME）以及基于表征的模型（如 LayerCAM）。
通过测量不同模型和数据集上 DNN 预测结果与解释图之间的统计和空间错位，分析预测-解释差距。
通过在一种解释器上生成对抗性输入并在其他解释器上测试其有效性，研究 Adv2 在不同解释模型之间的可迁移性。
提出对抗性解释蒸馏（Aid），一种在解释器训练过程中整合 Adv2 生成样本的对抗性训练框架，以提升鲁棒性。

实验结果

研究问题

RQ1能否构造出对抗性输入，以同时操纵 DNN 的预测结果和其关联解释器生成的解释？
RQ2预测-解释差距在实现双重操纵中起到什么作用？其影响在不同解释模型之间如何变化？
RQ3Adv2 生成的对抗性输入在不同类型解释模型之间的可迁移性如何？
RQ4使用 Adv2 输入进行对抗性训练是否能提升解释模型对这类攻击的鲁棒性？
RQ5目前在安全关键应用中对可解释性的依赖在多大程度上提供了虚假的安全感？

主要发现

Adv2 能够成功生成对抗性输入，同时误导 DNN 分类器及其解释模型，使攻击者能够任意控制预测结果和解释内容。
在基准数据集（如 CIFAR-10、ImageNet）和真实应用场景（如皮肤癌诊断）上的实证评估表明，Adv2 在多种 DNN 与解释模型组合中均实现了高成功率。
预测-解释差距——即解释模型未能完全与 DNN 决策过程对齐——被识别为实现双重操纵的关键脆弱点。
Adv2 在不同类型解释模型之间的可迁移性较低，表明来自互补视角的模型（如反向传播 vs. 输入扰动）不太可能被同一组对抗性输入所欺骗。
对抗性解释蒸馏（Aid）能有效减小预测-解释差距，并通过消融实验验证了其显著提升了解释器对 Adv2 攻击的鲁棒性。
本研究揭示，可解释性本身无法在对抗性环境中被完全信任为安全机制，因为攻击者可利用预测与解释之间的错位实现攻击。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。