[论文解读] Visual Adversarial Examples Jailbreak Aligned Large Language Models
论文表明,视觉对抗性输入可以越狱对齐护栏,导致在多模态视觉能力的大型语言模型中生成有害内容,超出目标的少-shot 语料,适用于多种模型及黑箱转移场景。
Recently, there has been a surge of interest in integrating vision into Large Language Models (LLMs), exemplified by Visual Language Models (VLMs) such as Flamingo and GPT-4. This paper sheds light on the security and safety implications of this trend. First, we underscore that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks, representing an expanded attack surface of vision-integrated LLMs. Second, we highlight that the versatility of LLMs also presents visual attackers with a wider array of achievable adversarial objectives, extending the implications of security failures beyond mere misclassification. As an illustration, we present a case study in which we exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision. Intriguingly, we discover that a single visual adversarial example can universally jailbreak an aligned LLM, compelling it to heed a wide range of harmful instructions that it otherwise would not) and generate harmful content that transcends the narrow scope of a `few-shot' derogatory corpus initially employed to optimize the adversarial example. Our study underscores the escalating adversarial risks associated with the pursuit of multimodality. Our findings also connect the long-studied adversarial vulnerabilities of neural networks to the nascent field of AI alignment. The presented attack suggests a fundamental adversarial challenge for AI alignment, especially in light of the emerging trend toward multimodality in frontier foundation models.
研究动机与目标
- 强调视觉输入为具备视觉能力的LLM扩展的攻击面;
- 证明单个视觉对抗性示例可普遍越狱对齐的VLMs;
- 显示越狱在多种模型之间及黑箱条件下的可转移性;
- 将神经网络中的对抗性漏洞与多模态模型中的AI对齐挑战联系起来。
提出的方法
- 通过在 x_adv 条件下最小化小型少-shot 有害语料 Y 的负对数似然来形成对抗输入 x_adv(式(1))。
- 利用端到端可微的可视扰动通过PGD在约束(epsilon)或无约束设置下对 x_adv 进行优化。
- 将 x_adv 与有害指令 x_harm 组成联合输入 [x_adv, x_harm] 以触发越狱输出。
- 将可视攻击与文本攻击进行比较,文本攻击使用离散优化(hotflip/Shin 等)以匹配长度的对抗文本标记对比。
- 在具视觉能力的 Vicuna 基模型(MiniGPT-4、InstructBLIP)以及在 LLaVA/LLaMA-2-Chat 上评估攻击,包括转移性分析。
实验结果
研究问题
- RQ1视觉对抗性示例是否能够普遍越狱具视觉能力的LLM 的对齐护栏?
- RQ2在越狱和引发有害输出的有效性方面,视觉攻击与仅文本的对抗攻击有何差异?
- RQ3视觉对抗性越狱是否可在不同VLM之间转移(黑箱设置)?
- RQ4这些视觉对抗性示例所诱导的有害输出范围是否超出用于优化的少-shot 语料?
主要发现
| 场景 | 身份攻击 | 虚假信息 | 暴力/犯罪 | X风险 |
|---|---|---|---|---|
| benign image (no attack) | 26.2 | 48.9 | 50.1 | 20.0 |
| adv.image (eps16) | 61.5 | 58.9 | 80.0 | 50.0 |
| adv.image (eps32) | 70.0 | 74.4 | 87.3 | 73.3 |
| adv.image (eps64) | 77.7 | 84.4 | 81.3 | 53.3 |
| adv.image (unconstrained) | 78.5 | 91.1 | 84.0 | 63.3 |
| adv. text (unconstrained) | 58.5 | 68.9 | 24.0 | 26.7 |
- 单个视觉对抗性示例即可显著增加对齐的VLM在多个类别(身份攻击、虚假信息、暴力/犯罪、X风险)输出有害内容的可能性;
- 在epsilon高达64/255甚至无约束的视觉输入下,在人工评估中对四个类别均实现了较高的越狱成功率;
- 视觉对抗性示例还提升 RealToxicityPrompts 的毒性指标,按 Perspective API 与 Detoxify 的测量,输出具有毒性属性的比例上升;
- 与等长度的文本对抗攻击相比,视觉攻击通常产生更强的越狱效果并更紧凑地降低优化损失;
- 该攻击演示了 MiniGPT-4(Vicuna)、InstructBLIP(Vicuna)与 LLaVA(LLaMA-2-Chat)之间的黑箱转移性;
- 基于 DiffPure 的净化可以缓解视觉对抗输入引起的部分毒性上升。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。