QUICK REVIEW

[논문 리뷰] On the (Statistical) Detection of Adversarial Examples

Kathrin Grosse, Praveen Manoharan|arXiv (Cornell University)|2017. 02. 21.

Adversarial Robustness in Machine Learning참고 문헌 34인용 수 375

한 줄 요약

이 논문은 적대적 예제가 합법적 데이터와 통계적으로 다르며 커널 기반 두 샘플 검정으로 탐지할 수 있음을 보여주고, per-input 탐지를 제공하는 보강된 이상치 클래스가 있으며 MNIST, DREBIN, MicroRNA에서 평가되었다.

ABSTRACT

Machine Learning (ML) models are applied in a variety of tasks such as network intrusion detection or Malware classification. Yet, these models are vulnerable to a class of malicious inputs known as adversarial examples. These are slightly perturbed inputs that are classified incorrectly by the ML model. The mitigation of these adversarial inputs remains an open problem. As a step towards understanding adversarial examples, we show that they are not drawn from the same distribution than the original data, and can thus be detected using statistical tests. Using thus knowledge, we introduce a complimentary approach to identify specific inputs that are adversarial. Specifically, we augment our ML model with an additional output, in which the model is trained to classify all adversarial inputs. We evaluate our approach on multiple adversarial example crafting methods (including the fast gradient sign and saliency map methods) with several datasets. The statistical test flags sample sets containing adversarial inputs confidently at sample sizes between 10 and 100 data points. Furthermore, our augmented model either detects adversarial examples as outliers with high accuracy (> 80%) or increases the adversary's cost - the perturbation added - by more than 150%. In this way, we show that statistical properties of adversarial examples are essential to their detection.

연구 동기 및 목표

적대적 예제가 합법적 훈련 데이터 분포와 통계적으로 다름을 입증한다.
여러 데이터셋과 공격에 걸쳐 적대적 분포를 탐지하기 위한 통계적 검정(MMD 기반)을 평가한다.
개별 적대적 입력을 탐지하기 위해 모델에 이상치 클래스(outlier class)를 추가하는 통합 방어를 제안한다.
화이트박스 및 블랙박스 공격 시나리오에서 제안된 방어의 강건성을 평가한다.

제안 방법

부트스트랩된 영가설 분포를 사용하여 커널 기반 두 샘플 검정(MMD with a Gaussian kernel)으로 적대적 분포와 정상 분포를 구분한다.
학습 분포에서 뽑은 샘플과 적대적으로 섭동된 데이터 사이의 Maximum Mean Discrepancy (MMD)와 Energy Distance (ED)를 측정한다.
MNIST, DREBIN(Android 악성코드), MicroRNA 데이터셋에서 여러 적대적 제작 방법(FGSM, JSMA, SVM 공격, DT 공격)을 통해 탐지 성능을 평가한다.
추가적인 이상치 클래스(outlier class)를 모델에 보강하고 적대적 입력을 이 클래스로 분류하도록 훈련하여 테스트 시점에 입력별 탐지를 가능하게 한다.
적응형 공격을 포함한 화이트박스 및 블랙박스 위협 모델 하에서 성능을 비교한다.

실험 결과

연구 질문

RQ1적은 샘플 크기일 때도 통계적 검정이 모델의 학습 분포에서 적대적 분포를 구분할 수 있을까?
RQ2이상치 클래스가 보강된 모델이 테스트 시점에 적대적 입력을 신뢰성 있게 탐지할 수 있을까?
RQ3제안된 방어는 적응형/블랙박스 공격자에 대해 얼마나 강건한가?
RQ4이상치 클래스 접근법을 사용할 때 탐지 비용(perturbation 증가)과 오분류에 미치는 영향은 무엇인가?

주요 결과

조작	매개변수	MMD	ED
Original	-	0.105	130.85
FGSM	ε=0.07	0.281	157.904
FGSM	ε=0.275	0.603	213.967
JSMA	-	0.14	137.63
DT attack	-	0.1	130.71
SVM attack	ε=0.25	0.524	186.32
Flipped	-	0.306	135.0
Subsampling	45 pixel	2.159	102.7
Gaussian Blur	4 pixel	1.021	128.52

커널 기반 검정(MMD 및 ED)은 대부분의 경우 최소 50개의 입력 샘플로도 적대적 분포를 탐지할 수 있다.
두 샘플 검정은 적대적 입력이 있을 때 영가설을 기각하고, 정상 샘플은 대략 95%의 정상 가설 수용으로 올바르게 식별된다.
이상치 클래스를 가진 보강 모델은 두 데이터셋에서 80% 이상 의 적대적 예를 탐지하며, 모델을 오도하기 위한 공격 섭동 비용을 150% 이상 증가시킨다.
블랙박스/적응형 공격 하에서도 방어는 강건성을 유지하며, 최악의 경우 60%의 정확도로 적대자가 탐지되고 많은 설정에서 90%를 초과하는 경우가 많다; 탐지되지 않은 적대적 입력은 대개 더 큰 섭동을 필요로 한다.
일부 공격/데이터셋(MNIST의 JSMA 또는 MNIST의 DT 공격 등)에서 탐지가 덜 확신되며, 이러한 적대자에 대해 관찰된 통계적 발산이 약하다는 점과 일치한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.