QUICK REVIEW

[논문 리뷰] Visual Adversarial Examples Jailbreak Aligned Large Language Models

Xiangyu Qi, Kaixuan Huang|arXiv (Cornell University)|2023. 06. 22.

Adversarial Robustness in Machine Learning인용 수 12

한 줄 요약

이 논문은 시각적 적대 입력이 비전-활성화 LLM의 정렬 가드레일을 우회하도록 만들어 대상 소수-shot 말뭉치를 넘어서 유해 콘텐츠 생성을 가능하게 하며, 여러 모델과 블랙박스 전이 설정에서 이를 확인한다.

ABSTRACT

Recently, there has been a surge of interest in integrating vision into Large Language Models (LLMs), exemplified by Visual Language Models (VLMs) such as Flamingo and GPT-4. This paper sheds light on the security and safety implications of this trend. First, we underscore that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks, representing an expanded attack surface of vision-integrated LLMs. Second, we highlight that the versatility of LLMs also presents visual attackers with a wider array of achievable adversarial objectives, extending the implications of security failures beyond mere misclassification. As an illustration, we present a case study in which we exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision. Intriguingly, we discover that a single visual adversarial example can universally jailbreak an aligned LLM, compelling it to heed a wide range of harmful instructions that it otherwise would not) and generate harmful content that transcends the narrow scope of a `few-shot' derogatory corpus initially employed to optimize the adversarial example. Our study underscores the escalating adversarial risks associated with the pursuit of multimodality. Our findings also connect the long-studied adversarial vulnerabilities of neural networks to the nascent field of AI alignment. The presented attack suggests a fundamental adversarial challenge for AI alignment, especially in light of the emerging trend toward multimodality in frontier foundation models.

연구 동기 및 목표

시각 입력이 시각-통합 LLM의 확장된 공격 표면을 만든다는 점을 강조한다.
단일 시각 적대 예제가 보편적으로 정렬된 VLM을 탈옥시킬 수 있음을 시연한다.
여러 모델 및 블랙박스 조건에서 탈옥이 전이되는지 보여준다.
신경망의 적대적 취약성이 멀티모달 모델의 AI 정렬 도전과제와 연결된다.

제안 방법

작은 소수-shot 유해 말뭉치 Y가 x_adv에서 조건화될 때 음의 로그 가능도(nll)를 최소화하여 adversarial 입력 x_adv를 형식화한다( Eqn 1 ).
제한(Epsilon) 또는 무제한 설정하에 엔드-투-엔드 미분 가능 시각 교란을 통해 PGD로 x_adv를 최적화한다.
x_adv와 해로운 지시 x_harm를 [x_adv, x_harm]의 공동 입력으로 짝지어 탈옥된 출력을 유발한다.
대응하는 길이의 어휘 토큰을 이산 최적화(hotflip/Shin 등)로 최적화하는 텍스트 기반 공격과 시각 공격을 비교한다.
비전-통합 Vicuna 기반 모델들(MiniGPT-4, InstructBLIP) 및 LLaVA/LLaMA-2-Chat에서의 공격과 전이 가능성 분석을 평가한다.

실험 결과

연구 질문

RQ1시각적 적대 예제가 시각-활성화 LLM에서 보편적인 정렬 해제를 가능하게 하는가?
RQ2정책 위반 유도 및 독성 유발 측면에서 시각 공격이 텍스트만 있는 적대 공격보다 효과적인가?
RQ3시각적 적대 해제가 서로 다른 VLM 간에(블랙박스 설정에서) 전이 가능한가?
RQ4최적화를 위해 사용된 소수-shot 말뭉치를 넘어서는 이러한 시각적 적대 예제로 유도된 유해 출력의 범위는 무엇인가?

주요 결과

단일 시각적 적대 예제가 여러 범주(정체성 공격, 허위정보, 폭력/범죄, X-리스크) 전반에서 정렬된 VLM이 출력할 유해 콘텐츠의 가능성을 크게 증가시킨다.
epsilon 64/255까지의 공격 및 무제한 시각 자료에서도 사람 평가에서 네 가지 범주 모두에서 높은 탈옥 성공을 달성한다.
시각적 적대 예제는 RealToxicityPrompts의 독성 지표도 상승시키며 Perspective API와 Detoxify로 측정된 독성 속성을 가진 출력의 비율을 증가시킨다.
동일 토큰 길이의 텍스트 전용 적대 공격에 비해 시각 공격이 일반적으로 더 강한 탈옥 효과와 더 큰 최적화 손실 감소를 보인다.
공격은 MiniGPT-4(Vicuna), InstructBLIP(Vicuna), LLaVA(LLaMA-2-Chat) 간의 블랙박스 전이를 시연한다.
DiffPure 기반 정화가 시각적 적대 입력으로 인한 일부 독성 증가를 완화할 수 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.