QUICK REVIEW

[论文解读] Visual Adversarial Examples Jailbreak Aligned Large Language Models

Xiangyu Qi, Kaixuan Huang|arXiv (Cornell University)|Jun 22, 2023

Adversarial Robustness in Machine Learning被引用 12

一句话总结

论文表明，视觉对抗性输入可以越狱对齐护栏，导致在多模态视觉能力的大型语言模型中生成有害内容，超出目标的少-shot 语料，适用于多种模型及黑箱转移场景。

ABSTRACT

Recently, there has been a surge of interest in integrating vision into Large Language Models (LLMs), exemplified by Visual Language Models (VLMs) such as Flamingo and GPT-4. This paper sheds light on the security and safety implications of this trend. First, we underscore that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks, representing an expanded attack surface of vision-integrated LLMs. Second, we highlight that the versatility of LLMs also presents visual attackers with a wider array of achievable adversarial objectives, extending the implications of security failures beyond mere misclassification. As an illustration, we present a case study in which we exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision. Intriguingly, we discover that a single visual adversarial example can universally jailbreak an aligned LLM, compelling it to heed a wide range of harmful instructions that it otherwise would not) and generate harmful content that transcends the narrow scope of a `few-shot' derogatory corpus initially employed to optimize the adversarial example. Our study underscores the escalating adversarial risks associated with the pursuit of multimodality. Our findings also connect the long-studied adversarial vulnerabilities of neural networks to the nascent field of AI alignment. The presented attack suggests a fundamental adversarial challenge for AI alignment, especially in light of the emerging trend toward multimodality in frontier foundation models.

研究动机与目标

强调视觉输入为具备视觉能力的LLM扩展的攻击面；
证明单个视觉对抗性示例可普遍越狱对齐的VLMs；
显示越狱在多种模型之间及黑箱条件下的可转移性；
将神经网络中的对抗性漏洞与多模态模型中的AI对齐挑战联系起来。

提出的方法

通过在 x_adv 条件下最小化小型少-shot 有害语料 Y 的负对数似然来形成对抗输入 x_adv（式(1)）。
利用端到端可微的可视扰动通过PGD在约束（epsilon）或无约束设置下对 x_adv 进行优化。
将 x_adv 与有害指令 x_harm 组成联合输入 [x_adv, x_harm] 以触发越狱输出。
将可视攻击与文本攻击进行比较，文本攻击使用离散优化（hotflip/Shin 等）以匹配长度的对抗文本标记对比。
在具视觉能力的 Vicuna 基模型（MiniGPT-4、InstructBLIP）以及在 LLaVA/LLaMA-2-Chat 上评估攻击，包括转移性分析。

实验结果

研究问题

RQ1视觉对抗性示例是否能够普遍越狱具视觉能力的LLM 的对齐护栏？
RQ2在越狱和引发有害输出的有效性方面，视觉攻击与仅文本的对抗攻击有何差异？
RQ3视觉对抗性越狱是否可在不同VLM之间转移（黑箱设置）？
RQ4这些视觉对抗性示例所诱导的有害输出范围是否超出用于优化的少-shot 语料？

主要发现

场景	身份攻击	虚假信息	暴力/犯罪	X风险
benign image (no attack)	26.2	48.9	50.1	20.0
adv.image (eps16)	61.5	58.9	80.0	50.0
adv.image (eps32)	70.0	74.4	87.3	73.3
adv.image (eps64)	77.7	84.4	81.3	53.3
adv.image (unconstrained)	78.5	91.1	84.0	63.3
adv. text (unconstrained)	58.5	68.9	24.0	26.7

单个视觉对抗性示例即可显著增加对齐的VLM在多个类别（身份攻击、虚假信息、暴力/犯罪、X风险）输出有害内容的可能性；
在epsilon高达64/255甚至无约束的视觉输入下，在人工评估中对四个类别均实现了较高的越狱成功率；
视觉对抗性示例还提升 RealToxicityPrompts 的毒性指标，按 Perspective API 与 Detoxify 的测量，输出具有毒性属性的比例上升；
与等长度的文本对抗攻击相比，视觉攻击通常产生更强的越狱效果并更紧凑地降低优化损失；
该攻击演示了 MiniGPT-4（Vicuna）、InstructBLIP（Vicuna）与 LLaVA（LLaMA-2-Chat）之间的黑箱转移性；
基于 DiffPure 的净化可以缓解视觉对抗输入引起的部分毒性上升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。