QUICK REVIEW

[논문 리뷰] Constructing Unrestricted Adversarial Examples with Generative Models

Yang Song, Rui Shu|arXiv (Cornell University)|2018. 05. 21.

Adversarial Robustness in Machine Learning참고 문헌 49인용 수 125

한 줄 요약

본 논문은 조건부 생성 모델(AC-GAN)을 사용해 처음부터 합성한 제한 없는 적대적 예제를 도입하고, 이들이 인간에게 합법적으로 보이면서 인증된 방어 및 적대적 학습을 우회할 수 있음을 보인다.

ABSTRACT

Adversarial examples are typically constructed by perturbing an existing data point within a small matrix norm, and current defense methods are focused on guarding against this type of attack. In this paper, we propose unrestricted adversarial examples, a new threat model where the attackers are not restricted to small norm-bounded perturbations. Different from perturbation-based attacks, we propose to synthesize unrestricted adversarial examples entirely from scratch using conditional generative models. Specifically, we first train an Auxiliary Classifier Generative Adversarial Network (AC-GAN) to model the class-conditional distribution over data samples. Then, conditioned on a desired class, we search over the AC-GAN latent space to find images that are likely under the generative model and are misclassified by a target classifier. We demonstrate through human evaluation that unrestricted adversarial examples generated this way are legitimate and belong to the desired class. Our empirical results on the MNIST, SVHN, and CelebA datasets show that unrestricted adversarial examples can bypass strong adversarial training and certified defense methods designed for traditional adversarial attacks.

연구 동기 및 목표

기존 데이터에 대한 작은 변화에 한정되지 않는 적대적 입력을 가정하는 위협 모델을 고무한다.
클래스-조건부 생성 모델을 사용하여 처음부터 제한 없는 적대적 예제를 생성하는 실용적인 방법을 제안한다.
여러 데이터셋에 걸쳐 인증된 방어 및 적대적 학습된 모델에 대한 공격 효과를 평가한다.

제안 방법

Auxiliary Classifier GAN(AC-GAN)을 학습시켜 클래스-조건부 이미지 분포를 모형화한다.
AC-GAN 잠재 공간을 탐색하여 모델 하에서 가능성이 높은 이미지 중에서 타깃 분류기가 잘못 분류하는 이미지를 찾는다.
L = L0 + λ1 L1 + λ2 L2로 정의된 손실 함수를 최적화하여 고충실도의 제한 없는 적대적 예제를 생성한다. 여기서 L0는 분류기를 목표로 하고, L1은 잠재 코드의 정규화, L2는 보조 분류기를 통해 원래 클래스와의 정합성을 맞춘다.
다양성 향상을 위해 생성된 이미지에 학습 가능한 소량의 노이즈를 추가하는 선택적 노이즈 보강 공격을 사용한다.
Amazon Mechanical Turk를 사용해 합성 이미지가 의도된 클래스로 속하는지, 인간에게 합당한지 확인한다.

실험 결과

연구 질문

RQ1처음부터 합성한 제한 없는 적대적 예제—이미지—가 강력한 방어에도 불구하고 분류기를 혼동시킬 수 있는가?
RQ2MNIST, SVHN, CelebA에서 AC-GAN 기반의 제한 없는 적대적 예제가 인증된 방어 및 적대적 학습에 대해 얼마나 효과적인가?
RQ3다른 아키텍처로 전달(transferred)되는가?
RQ4생성기에 노이즈를 추가하는 것이 공격 효과나 현실감을 향상시키는가?

주요 결과

제한 없는 적대적 예제는 대상 분류기를 속이는 데 MNIST, SVHN, CelebA 데이터셋에서 높은 성공률(84% 이상)을 달성한다.
이 공격은 노이즈 기반 공격에 맞춘 인증된 방어를 우회하고, 적대적 학습 모델도 위협한다.
MTurk를 통한 인간 평가에서 많은 제한 없는 적대적 예제가 대상 클래스에 속하는 합법적인 이미지임이 확인된다.
노이즈 보강 변형은 일부 모델로의 전달 가능성을 향상시키는 반면 다른 모델에는 다르게 영향을 준다.
제한 없는 적대적 예제는 다른 아키텍처로의 전달 가능성이 중간 정도로 나타나며 블랙 박스 위험 가능성을 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.