Skip to main content
QUICK REVIEW

[论文解读] Visual Adversarial Examples Jailbreak Aligned Large Language Models

Xiangyu Qi, Kaixuan Huang|arXiv (Cornell University)|Jun 22, 2023
Adversarial Robustness in Machine Learning被引用 12
一句话总结

论文表明,视觉对抗性输入可以越狱对齐护栏,导致在多模态视觉能力的大型语言模型中生成有害内容,超出目标的少-shot 语料,适用于多种模型及黑箱转移场景。

ABSTRACT

Recently, there has been a surge of interest in integrating vision into Large Language Models (LLMs), exemplified by Visual Language Models (VLMs) such as Flamingo and GPT-4. This paper sheds light on the security and safety implications of this trend. First, we underscore that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks, representing an expanded attack surface of vision-integrated LLMs. Second, we highlight that the versatility of LLMs also presents visual attackers with a wider array of achievable adversarial objectives, extending the implications of security failures beyond mere misclassification. As an illustration, we present a case study in which we exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision. Intriguingly, we discover that a single visual adversarial example can universally jailbreak an aligned LLM, compelling it to heed a wide range of harmful instructions that it otherwise would not) and generate harmful content that transcends the narrow scope of a `few-shot' derogatory corpus initially employed to optimize the adversarial example. Our study underscores the escalating adversarial risks associated with the pursuit of multimodality. Our findings also connect the long-studied adversarial vulnerabilities of neural networks to the nascent field of AI alignment. The presented attack suggests a fundamental adversarial challenge for AI alignment, especially in light of the emerging trend toward multimodality in frontier foundation models.

研究动机与目标

  • 强调视觉输入为具备视觉能力的LLM扩展的攻击面;
  • 证明单个视觉对抗性示例可普遍越狱对齐的VLMs;
  • 显示越狱在多种模型之间及黑箱条件下的可转移性;
  • 将神经网络中的对抗性漏洞与多模态模型中的AI对齐挑战联系起来。

提出的方法

  • 通过在 x_adv 条件下最小化小型少-shot 有害语料 Y 的负对数似然来形成对抗输入 x_adv(式(1))。
  • 利用端到端可微的可视扰动通过PGD在约束(epsilon)或无约束设置下对 x_adv 进行优化。
  • 将 x_adv 与有害指令 x_harm 组成联合输入 [x_adv, x_harm] 以触发越狱输出。
  • 将可视攻击与文本攻击进行比较,文本攻击使用离散优化(hotflip/Shin 等)以匹配长度的对抗文本标记对比。
  • 在具视觉能力的 Vicuna 基模型(MiniGPT-4、InstructBLIP)以及在 LLaVA/LLaMA-2-Chat 上评估攻击,包括转移性分析。

实验结果

研究问题

  • RQ1视觉对抗性示例是否能够普遍越狱具视觉能力的LLM 的对齐护栏?
  • RQ2在越狱和引发有害输出的有效性方面,视觉攻击与仅文本的对抗攻击有何差异?
  • RQ3视觉对抗性越狱是否可在不同VLM之间转移(黑箱设置)?
  • RQ4这些视觉对抗性示例所诱导的有害输出范围是否超出用于优化的少-shot 语料?

主要发现

场景身份攻击虚假信息暴力/犯罪X风险
benign image (no attack)26.248.950.120.0
adv.image (eps16)61.558.980.050.0
adv.image (eps32)70.074.487.373.3
adv.image (eps64)77.784.481.353.3
adv.image (unconstrained)78.591.184.063.3
adv. text (unconstrained)58.568.924.026.7
  • 单个视觉对抗性示例即可显著增加对齐的VLM在多个类别(身份攻击、虚假信息、暴力/犯罪、X风险)输出有害内容的可能性;
  • 在epsilon高达64/255甚至无约束的视觉输入下,在人工评估中对四个类别均实现了较高的越狱成功率;
  • 视觉对抗性示例还提升 RealToxicityPrompts 的毒性指标,按 Perspective API 与 Detoxify 的测量,输出具有毒性属性的比例上升;
  • 与等长度的文本对抗攻击相比,视觉攻击通常产生更强的越狱效果并更紧凑地降低优化损失;
  • 该攻击演示了 MiniGPT-4(Vicuna)、InstructBLIP(Vicuna)与 LLaVA(LLaMA-2-Chat)之间的黑箱转移性;
  • 基于 DiffPure 的净化可以缓解视觉对抗输入引起的部分毒性上升。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。