QUICK REVIEW

[论文解读] Jailbreaking Attack against Multimodal Large Language Model

Zhenxing Niu, Haodong Ren|arXiv (Cornell University)|Feb 4, 2024

Adversarial Robustness in Machine Learning被引用 11

一句话总结

本文提出基于最大似然的越狱攻击（imgJP 和 deltaJP）针对多模态大模型，显示出强数据通用性质和模型可迁移性，并引入基于构造的方法以扩展到对LLM的越狱，提升效率。

ABSTRACT

This paper focuses on jailbreaking attacks against multi-modal large language models (MLLMs), seeking to elicit MLLMs to generate objectionable responses to harmful user queries. A maximum likelihood-based algorithm is proposed to find an \emph{image Jailbreaking Prompt} (imgJP), enabling jailbreaks against MLLMs across multiple unseen prompts and images (i.e., data-universal property). Our approach exhibits strong model-transferability, as the generated imgJP can be transferred to jailbreak various models, including MiniGPT-v2, LLaVA, InstructBLIP, and mPLUG-Owl2, in a black-box manner. Moreover, we reveal a connection between MLLM-jailbreaks and LLM-jailbreaks. As a result, we introduce a construction-based method to harness our approach for LLM-jailbreaks, demonstrating greater efficiency than current state-of-the-art methods. The code is available here. extbf{Warning: some content generated by language models may be offensive to some readers.}

研究动机与目标

证明MLLMs易受通过图像提示(imgJP)和扰动(deltaJP)的越狱攻击。
构建一个最大似然框架以最大化有害查询的目标输出。
展示数据通用性质（提示和图像通用）以及在多个MLLM之间的模型可迁移性。
探索一种基于构造的方法，利用MLLM越狱实现高效的LLM越狱。
提出一种集成代理模型策略以提高对黑盒攻击的成功率。

提出的方法

将越狱表述为在查询和图像条件下最大化目标有害输出的对数概率（Eq. 1）。
在给定输入图像上施加攻击预算以扰动，扩展到 deltaJP（Eq. 2）。
通过对一组图像分布聚合扰动，将 deltaJP 泛化为图像通用(deltaJP)（Eq. 3）。
通过对多个代理MLLM进行集成学习以提升传递性（Eq. 4）。
通过从 embJP 派生 txtJP，采用去嵌入（De-embedding）和去标记化（De-tokenizer）操作，介绍一种基于构造的攻击以越狱目标LLM。
采用基于池的高效采样（RandSet）策略生成有效的 txtJP 令牌（Top-1、Random-1、RandSet）。

Figure 1: An example of a jailbreaking attack against MiniGPT-v2. With a normal image as input, MiniGPT-v2 will refuse to answer the harmful request ( e.g. , replying ‘ I’m sorry, I cannot fulfill your request ’). In contrast, with our generated imgJP , MiniGPT-v2 responds to the harmful request.

实验结果

研究问题

RQ1是否能够在未见的提示和图像下，使用图像提示(imgJP)和图像扰动(deltaJP)可靠越狱MLLMs？
RQ2越狱提示是否在多种有害提示和图像类别中展现出数据通用性质？
RQ3在不同的MLLM架构下，黑盒设定中的越狱是否具备可迁移性？
RQ4是否能够利用MLLM越狱技术通过基于构造的方法实现高效的LLM越狱？
RQ5集成代理模型策略如何影响传递性和成功率？

主要发现

基于 imgJP 的越狱在若干 MLLMs 的白盒设置下实现高的ASR（例如在不同配置下训练/测试达到77–93%）。
基于 deltaJP 的越狱在提示通用和图像通用方面均有表现，且在各类别的有效性各异。
通过对 mPLUG-Owl2、LLaVA、MiniGPT-v2、InstructBLIP 的黑盒越狱取得成功，且使用集成代理获得显著提升，显示模型可迁移性。
基于构造的LLM越狱在效率上表现出色（例如使用20个反向 txtJP 的池获得93% ASR），相较于现有方法。
将三个代理模型进行集成可提升传递性，相较单模型代理在目标模型上获得更高的ASR。

Figure 2: The jailbreaks with imgJP. Given a harmful request, we attempt to maximize the likelihood of generating the corresponding target outputs. The target outputs typically commence with a positive affirmation, such as “Sure, here is a (content of query)”.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。