QUICK REVIEW

[논문 리뷰] Interpretation of Neural Networks is Fragile

Amirata Ghorbani, Abubakar Abid|arXiv (Cornell University)|2017. 10. 29.

Explainable Artificial Intelligence (XAI)참고 문헌 24인용 수 77

한 줄 요약

논문은 신경망의 해석(주성분 맵과 예시 기반 설명)이 예측 레이블을 바꾸지 않고도 인지적으로 구별하기 어려운 작은 입력 섭 perturbations에 의해 여러 해석 방법과 데이터셋에서 급격하게 달라질 수 있음을 보여준다.

ABSTRACT

In order for machine learning to be deployed and trusted in many applications, it is crucial to be able to reliably explain why the machine learning algorithm makes certain predictions. For example, if an algorithm classifies a given pathology image to be a malignant tumor, then the doctor may need to know which parts of the image led the algorithm to this classification. How to interpret black-box predictors is thus an important and active area of research. A fundamental question is: how much can we trust the interpretation itself? In this paper, we show that interpretation of deep learning predictions is extremely fragile in the following sense: two perceptively indistinguishable inputs with the same predicted label can be assigned very different interpretations. We systematically characterize the fragility of several widely-used feature-importance interpretation methods (saliency maps, relevance propagation, and DeepLIFT) on ImageNet and CIFAR-10. Our experiments show that even small random perturbation can change the feature importance and new systematic perturbations can lead to dramatically different interpretations without changing the label. We extend these results to show that interpretations based on exemplars (e.g. influence functions) are similarly fragile. Our analysis of the geometry of the Hessian matrix gives insight on why fragility could be a fundamental challenge to the current interpretation approaches.

연구 동기 및 목표

모형 해석에 대한 신뢰를 촉진하고 그 강건성을 정량화한다.
예측은 유지하면서 해석을 변경하는 적대적 perturbations를 도입한다.
ImageNet와 CIFAR-10에서 특징 중요도 및 예시 기반 해석의 강건성을 체계적으로 평가한다.
고차원 비선형 모델에서 해석의 취약성이 왜 발생하는지 이론적·경험적 통찰을 제공한다.

제안 방법

고정된 예측 하에서 해석 간 서로 다른 정도를 최대화하는 적대적 perturbations를 정의한다.
세 가지 전략(top-k, mass-center, targeted)과 반복적 최적화 절차를 통해 특징 중요도 방법(간단한 그래디언트, DeepLIFT, 통합 그래디언트)을 공격한다.
그래디언트 부호 방법을 사용하여 영향 함수(훈련 예시 기반 설명)를 공격한다.
ImageNet(SqueezeNet)과 CIFAR-10(맞춤 CNN)에서 공격을 평가하고 상위 1000개 교집합 및 스피어만 순위 상관관계로 강건성을 평가한다.
왜 고차원성·비선형성이 해석 취약성을 촉진하는지 설명하기 위해 Hessian 기반 분석을 사용한다.

실험 결과

연구 질문

RQ1모델의 예측을 변경하지 않는 작은 입력 perturbations가 주의도(주성분 맵)나 영향 함수와 같은 해석을 의미 있게 바꿀 수 있는가?
RQ2적대적 perturbations에 가장 취약한 해석 방법은 무엇인가?
RQ3해석의 강건성이 모델의 Hessian 기하학과 어떤 관련이 있는가?
RQ4예측을 속이더라도 CIFAR-10, ImageNet 등의 데이터세트 및 아키텍처 간에 perturbation이 일반화되는가?

주요 결과

특징 중요도 맵(그래디언트, DeepLIFT, 통합 그래디언트)은 원래 라벨을 유지하면서 인지적으로 구별하기 어려운 perturbations에 의해 상당한 변화로 유도될 수 있다.
Top-k 및 mass-center 공격은 세 가지 특징 중요도 방법 모두에서 상위 1000 교집합 및 순위 상관관계를 악화시키는 데 유사하게 효과적이며 무작위 부호 perturbation보다 우수하다.
통합 그래디언트는 그래디언트나 DeepLIFT에 비해 적대적 해석 공격에 상대적으로 더 강인하다.
영향 함수 설명도 perturbation에 매우 민감하며, 그래디언트 부호 공격하에서 가장 영향력 있는 훈련 예시가 크게 바뀐다.
공격은 해석을 의미적으로 재배치할 수 있으며(예: saliency를 비주요 영역으로 이동시키거나 의미상 관련 없는 예시로 이동), 예측은 바꾸지 않는다.
Hessian 기반 분석은 고차원성 및 비선형성이 해석 취약성을 뒷받침하며 해석 취약성과 예측 perturbation 간 직교성을 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.