QUICK REVIEW

[論文レビュー] Fooling Neural Network Interpretations via Adversarial Model Manipulation

Juyeon Heo, Sunghwan Joo|arXiv (Cornell University)|Feb 6, 2019

Adversarial Robustness in Machine Learning参考文献 35被引用数 73

ひとこと要約

本論文は、最新の感度ベースの解釈器（LRP、Grad-CAM、SimpleGrad）が、事前学習済みモデルをファインチューニングすることで、説明を変更し、手法間で転移する受動的および能動的操作を通じて、精度を損なうことなく騙され得ることを示している。

ABSTRACT

We ask whether the neural network interpretation methods can be fooled via adversarial model manipulation, which is defined as a model fine-tuning step that aims to radically alter the explanations without hurting the accuracy of the original models, e.g., VGG19, ResNet50, and DenseNet121. By incorporating the interpretation results directly in the penalty term of the objective function for fine-tuning, we show that the state-of-the-art saliency map based interpreters, e.g., LRP, Grad-CAM, and SimpleGrad, can be easily fooled with our model manipulation. We propose two types of fooling, Passive and Active, and demonstrate such foolings generalize well to the entire validation set as well as transfer to other interpretation methods. Our results are validated by both visually showing the fooled explanations and reporting quantitative metrics that measure the deviations from the original explanations. We claim that the stability of neural network interpretation method with respect to our adversarial model manipulation is an important criterion to check for developing robust and reliable neural network interpretation method.

研究の動機と目的

敵対的なモデル操作の下でニューラルネットワークの解釈手法の安定性を評価する。
標準的なアーキテクチャ（VGG19、ResNet50、DenseNet121）上で、人気の高い感度ベースの解釈器が欺けることを示す。
受動的および能動的な欺瞞スキームと、解釈手法間の転送可能性を特徴づける。
大規模データ（ImageNet）での欺瞞耐性を評価し、説明の信頼性への影響を論じる。

提案手法

分類損失と解釈ベースのペナルティ項を組み合わせた目的関数で、事前学習済みモデルをファインチューニングする。
無情報的な説明を生成するための受動的欺瞞（Location、Top-k、Center-mass）を定義する。
専用の欺瞞データセットを用いて、2つのターゲットクラス間で説明を入れ替える能動的欺瞞を定義する。
解釈のヒートマップを生成するために、3つの解釈器（LRP-Composite、Grad-CAM、SimpleGrad）を使用する。
ImageNetで分割した検証データを用いて、欺瞞タイプごとに事前定義された閾値を用いて、Fooling Success Rate (FSR) を評価する。
AOPCによる頑健性評価と、敵対的訓練の実験を行う。

実験結果

リサーチクエスチョン

RQ1標的なモデルのファインチューニング後に、LRP、Grad-CAM、SimpleGrad のような解釈手法がモデルの根拠を信頼性高く反映できるか。
RQ2受動的欺瞞（Location、Top-k、Center-mass）は、精度に実質的な影響を与えず解釈可能性を低下させるか。
RQ3能動的欺瞞はクラス間で説明を入れ替えることができ、解釈器間でこの操作が転移可能か。
RQ4欺瞞挙動はアーキテクチャ（VGG19、ResNet50、DenseNet121）および敵対的訓練などの防御訓練によってどう異なるか。

主な発見

解釈手法は、最小限の精度低下（Top-1 約2%、Top-5 約1%）で敵対的なモデル操作に対して脆弱である。
受動的欺瞞は検証データセット全体で説明を一貫して誤導し、解釈器間で転移する（LRP T、Grad-CAM、SimpleG T）。
能動的欺瞞は2つのクラス間で説明を入れ替えることができ、アーキテクチャによって成功度は異なる（VGG19/ResNet50では高成功、DenseNet121では限定的）。
欺瞞は解釈器間および他のアーキテクチャへ転移し、感度ベースの説明における系统的な安定性問題を示唆している。
この操作は単純な摂動や敵対的訓練では検出・打ち消しが難しく、ガウス摂動の下でも持続し得る。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。