QUICK REVIEW

[논문 리뷰] Ensemble Adversarial Training: Attacks and Defenses

Florian Tramèr, Alexey Kurakin|arXiv (Cornell University)|2017. 05. 19.

Adversarial Robustness in Machine Learning참고 문헌 48인용 수 1,107

한 줄 요약

논문은 단일 단계 적대적 훈련이 그래디언트 마스킹으로 실패하는 이유를 분석하고, 정적 사전훈련 모델에서의 적대적 예제로 블랙박스 강건성을 개선하기 위해 Ensemble Adversarial Training을 소개한다.

ABSTRACT

Adversarial examples are perturbed inputs designed to fool machine learning models. Adversarial training injects such examples into training data to increase robustness. To scale this technique to large datasets, perturbations are crafted using fast single-step methods that maximize a linear approximation of the model's loss. We show that this form of adversarial training converges to a degenerate global minimum, wherein small curvature artifacts near the data points obfuscate a linear approximation of the loss. The model thus learns to generate weak perturbations, rather than defend against strong ones. As a result, we find that adversarial training remains vulnerable to black-box attacks, where we transfer perturbations computed on undefended models, as well as to a powerful novel single-step attack that escapes the non-smooth vicinity of the input data via a small random step. We further introduce Ensemble Adversarial Training, a technique that augments training data with perturbations transferred from other models. On ImageNet, Ensemble Adversarial Training yields models with strong robustness to black-box attacks. In particular, our most robust model won the first round of the NIPS 2017 competition on Defenses against Adversarial Attacks. However, subsequent work found that more elaborate black-box attacks could significantly enhance transferability and reduce the accuracy of our models.

연구 동기 및 목표

단일 단계 적대적 훈련이 왜 수렴하여 퇴화된 최소점으로 수렴하고 블랙박스 공격에 취약한지 설명한다.
훈련 과정에서 관찰되는 적대적 교란을 다양화하기 위해 Ensemble Adversarial Training을 제안한다.
ImageNet에서 강건성 증가를 시연하고 모델 간 공격의 전이성을 분석한다.

제안 방법

L_infinity 구속된 교란을 가진 적대적 훈련을 수식화한다.
단일 단계 공격에서의 그래디언트 마스킹/퇴화된 최소점을 시연한다.
R+FGSM: 단일 단계 공격에 대한 무작위 교란 선행 단계 도입이다.
정적 사전 학습 모델로부터의 적대적 예제를 도입하여 Ensemble Adversarial Training을 제안한다.
ImageNet에서 Inception v3 및 Inception ResNet v2를 대상으로 다양한 화이트박스 및 블랙박스 공격에 대해 평가한다.
화이트박스 대비 블랙박스 강건성의 수렴 및 트레이드오프를 논의한다.

실험 결과

연구 질문

RQ1단일 단계 적대적 훈련이 실제 손실 지형을 가리는 퇴화된 최소점을 만들 수 있는가?
RQ2정적 모델로부터의 적대적 교란 전이가 블랙박스 공격에 대한 강건성을 높이는가?
RQ3Ensemble Adversarial Training이 대규모 데이터세트에 걸친 다양한 공격 유형에 대한 강건성에 어떤 영향을 미치는가?

주요 결과

단일 단계 적대적 훈련은 그래디언트 마스킹을 나타내며, 데이터 포인트 주변의 손실에 대한 선형 근사의 효과를 감소시킨다.
단일 단계 방법으로 진행된 적대적 훈련은 화이트박스 강건성을 높이지만 전이성으로 인해 블랙박스 강건성을 저하시킨다.
새로운 R+FGSM 공격(무작위 시작 + FGSM)은 모델 간 단일 단계 공격을 강화한다.
Ensemble Adversarial Training(정적 사전 학습 모델로부터의 교란으로 학습)은 ImageNet에서 블랙박스 공격에 대한 강건성을 향상시킨다.
앙상블 모델은 적대적 교란의 전이성을 줄이지만 화이트박스 강건성은 타협될 수 있다.
최고의 앙상블 모델(IRv2_adv-ens)은 NIPS 2017 방어 대회에서 최고의 성능을 달성했고 당시 블랙박스 공격에 대한 주목할 만한 강건성을 보여주었다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.