QUICK REVIEW

[論文レビュー] Interpretable Deep Learning under Fire

Xinyang Zhang, Ningfei Wang|arXiv (Cornell University)|Dec 3, 2018

Adversarial Robustness in Machine Learning参考文献 70被引用数 17

ひとこと要約

本稿では、解釈可能ディープラーニングシステム（IDLSes）における深層ニューラルネットワーク（DNN）の予測とその関連する解釈モデルを同時に操作する、Adv2と呼ばれる新たな敵対的攻撃フレームワークを紹介する。研究では、既存のIDLSesがこうした攻撃に対して極めて脆弱であることが示され、攻撃者はモデルの出力とその説明を任意に制御可能となり、解釈可能性が提供するセキュリティ的保証を無効化する。主な貢献は、この脆弱性の根本的原因として「予測-解釈ギャップ」を特定し、敵対的解釈 distillation（Aid）などの対策を提案したことである。

ABSTRACT

Providing explanations for deep neural network (DNN) models is crucial for their use in security-sensitive domains. A plethora of interpretation models have been proposed to help users understand the inner workings of DNNs: how does a DNN arrive at a specific decision for a given input? The improved interpretability is believed to offer a sense of security by involving human in the decision-making process. Yet, due to its data-driven nature, the interpretability itself is potentially susceptible to malicious manipulations, about which little is known thus far. Here we bridge this gap by conducting the first systematic study on the security of interpretable deep learning systems (IDLSes). We show that existing \imlses are highly vulnerable to adversarial manipulations. Specifically, we present ADV^2, a new class of attacks that generate adversarial inputs not only misleading target DNNs but also deceiving their coupled interpretation models. Through empirical evaluation against four major types of IDLSes on benchmark datasets and in security-critical applications (e.g., skin cancer diagnosis), we demonstrate that with ADV^2 the adversary is able to arbitrarily designate an input's prediction and interpretation. Further, with both analytical and empirical evidence, we identify the prediction-interpretation gap as one root cause of this vulnerability -- a DNN and its interpretation model are often misaligned, resulting in the possibility of exploiting both models simultaneously. Finally, we explore potential countermeasures against ADV^2, including leveraging its low transferability and incorporating it in an adversarial training framework. Our findings shed light on designing and operating IDLSes in a more secure and informative fashion, leading to several promising research directions.

研究の動機と目的

解釈可能ディープラーニングシステム（IDLSes）のセキュリティ的脆弱性を調査すること。ここでは、DNN分類器とその解釈モデルの両方が敵対的操作の対象となる。
解釈可能性がセキュリティの強化要因と見なされる一方で、敵対的攻撃によってその役割が覆せることを理解する上で、重要なギャップを解消すること。
特にDNNの予測と解釈モデルの出力の不一致に起因する、IDLSの脆弱性の根本的原因を同定すること。
敵対的入力を異なる解釈モデル間でどのように転送可能かを評価し、アンサンブルベースの防御策を検討すること。
敵対的訓練フレームワーク「敵対的解釈 distillation（Aid）」を提案・検証し、解釈モデルの耐性を向上させること。

提案手法

DNNの予測とその関連する解釈モデルを同時に誤導する敵対的攻撃「Adv2」を提案。
攻撃者が望む結果に一致するように、DNNの予測クラスと解釈モデルのアトリビューションマップを制御するための共同最適化目的関数を設計。
勾配ベース（例：Grad-CAM）、活性化ベース（例：GradCAM++）、摂動ベース（例：LIME）、表現ベース（例：LayerCAM）の4つの主要な解釈モデルタイプに対して、Adv2の実験的評価を実施。
異なるモデルやデータセットにおいて、DNNの予測と解釈マップの間の統計的および空間的不一致を測定することで、予測-解釈ギャップを分析。
1つの解釈モデルで生成した敵対的入力を他の解釈モデルに適用した際の転送性を調査。
Adv2によって生成された例を解釈モデルの学習段階に統合することで、耐性を向上させる敵対的訓練フレームワーク「敵対的解釈 distillation（Aid）」を提案。

実験結果

リサーチクエスチョン

RQ1敵対的入力を設計することで、DNNの予測とその関連する解釈モデルが出力する説明を同時に操作することは可能か？
RQ2予測-解釈ギャップがこうした二重操作を可能にする役割を果たしているのか、また、異なる解釈モデル間でその影響はどのように変化するか？
RQ3Adv2によって生成された敵対的入力は、異なるタイプの解釈モデル間でどれほど転送可能か？
RQ4Adv2によって生成された入力を用いた敵対的訓練により、解釈モデルの耐性は向上するか？
RQ5現在、セキュリティが重要な応用分野で解釈可能性に依存していることにより、誤った安心感が生じているとは、どの程度まで言えるか？

主な発見

Adv2は、DNN分類器とその解釈モデルを同時に誤導する敵対的入力を効果的に生成でき、攻撃者が予測と説明の両方を任意に制御可能であることを確認した。
ベンチマークデータセット（例：CIFAR-10、ImageNet）および実世界の応用（例：皮膚がん診断）を用いた実験的評価により、Adv2は多様なDNNと解釈モデルの組み合わせにおいて高い成功率を達成した。
予測-解釈ギャップ（解釈モデルがDNNの意思決定と完全に一致しないこと）が、二重操作を可能にする主な脆弱性要因であると特定された。
Adv2は異なるタイプの解釈モデル間での転送性が低く、バックプロパゲーションと入力摂動の異なる視点を持つモデル同士は、同じ敵対的入力で容易に騙されない傾向にあることが示された。
敵対的解釈 distillation（Aid）は、予測-解釈ギャップを効果的に縮小し、Ablation studyを通じてAdv2攻撃に対して解釈モデルの耐性が向上することを示した。
本研究では、敵対的環境下では解釈可能性そのものがセキュリティメカニズムとして信頼できないことが明らかになった。攻撃者は予測と説明の不一致を悪用できるからである。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。