Skip to main content
QUICK REVIEW

[论文解读] Jailbreaking Attack against Multimodal Large Language Model

Zhenxing Niu, Haodong Ren|arXiv (Cornell University)|Feb 4, 2024
Adversarial Robustness in Machine Learning被引用 11
一句话总结

本文提出基于最大似然的越狱攻击(imgJP 和 deltaJP)针对多模态大模型,显示出强数据通用性质和模型可迁移性,并引入基于构造的方法以扩展到对LLM的越狱,提升效率。

ABSTRACT

This paper focuses on jailbreaking attacks against multi-modal large language models (MLLMs), seeking to elicit MLLMs to generate objectionable responses to harmful user queries. A maximum likelihood-based algorithm is proposed to find an \emph{image Jailbreaking Prompt} (imgJP), enabling jailbreaks against MLLMs across multiple unseen prompts and images (i.e., data-universal property). Our approach exhibits strong model-transferability, as the generated imgJP can be transferred to jailbreak various models, including MiniGPT-v2, LLaVA, InstructBLIP, and mPLUG-Owl2, in a black-box manner. Moreover, we reveal a connection between MLLM-jailbreaks and LLM-jailbreaks. As a result, we introduce a construction-based method to harness our approach for LLM-jailbreaks, demonstrating greater efficiency than current state-of-the-art methods. The code is available here. extbf{Warning: some content generated by language models may be offensive to some readers.}

研究动机与目标

  • 证明MLLMs易受通过图像提示(imgJP)和扰动(deltaJP)的越狱攻击。
  • 构建一个最大似然框架以最大化有害查询的目标输出。
  • 展示数据通用性质(提示和图像通用)以及在多个MLLM之间的模型可迁移性。
  • 探索一种基于构造的方法,利用MLLM越狱实现高效的LLM越狱。
  • 提出一种集成代理模型策略以提高对黑盒攻击的成功率。

提出的方法

  • 将越狱表述为在查询和图像条件下最大化目标有害输出的对数概率(Eq. 1)。
  • 在给定输入图像上施加攻击预算以扰动,扩展到 deltaJP(Eq. 2)。
  • 通过对一组图像分布聚合扰动,将 deltaJP 泛化为图像通用(deltaJP)(Eq. 3)。
  • 通过对多个代理MLLM进行集成学习以提升传递性(Eq. 4)。
  • 通过从 embJP 派生 txtJP,采用去嵌入(De-embedding)和去标记化(De-tokenizer)操作,介绍一种基于构造的攻击以越狱目标LLM。
  • 采用基于池的高效采样(RandSet)策略生成有效的 txtJP 令牌(Top-1、Random-1、RandSet)。
Figure 1: An example of a jailbreaking attack against MiniGPT-v2. With a normal image as input, MiniGPT-v2 will refuse to answer the harmful request ( e.g. , replying ‘ I’m sorry, I cannot fulfill your request ’). In contrast, with our generated imgJP , MiniGPT-v2 responds to the harmful request.
Figure 1: An example of a jailbreaking attack against MiniGPT-v2. With a normal image as input, MiniGPT-v2 will refuse to answer the harmful request ( e.g. , replying ‘ I’m sorry, I cannot fulfill your request ’). In contrast, with our generated imgJP , MiniGPT-v2 responds to the harmful request.

实验结果

研究问题

  • RQ1是否能够在未见的提示和图像下,使用图像提示(imgJP)和图像扰动(deltaJP)可靠越狱MLLMs?
  • RQ2越狱提示是否在多种有害提示和图像类别中展现出数据通用性质?
  • RQ3在不同的MLLM架构下,黑盒设定中的越狱是否具备可迁移性?
  • RQ4是否能够利用MLLM越狱技术通过基于构造的方法实现高效的LLM越狱?
  • RQ5集成代理模型策略如何影响传递性和成功率?

主要发现

  • 基于 imgJP 的越狱在若干 MLLMs 的白盒设置下实现高的ASR(例如在不同配置下训练/测试达到77–93%)。
  • 基于 deltaJP 的越狱在提示通用和图像通用方面均有表现,且在各类别的有效性各异。
  • 通过对 mPLUG-Owl2、LLaVA、MiniGPT-v2、InstructBLIP 的黑盒越狱取得成功,且使用集成代理获得显著提升,显示模型可迁移性。
  • 基于构造的LLM越狱在效率上表现出色(例如使用20个反向 txtJP 的池获得93% ASR),相较于现有方法。
  • 将三个代理模型进行集成可提升传递性,相较单模型代理在目标模型上获得更高的ASR。
Figure 2: The jailbreaks with imgJP. Given a harmful request, we attempt to maximize the likelihood of generating the corresponding target outputs. The target outputs typically commence with a positive affirmation, such as “Sure, here is a (content of query)”.
Figure 2: The jailbreaks with imgJP. Given a harmful request, we attempt to maximize the likelihood of generating the corresponding target outputs. The target outputs typically commence with a positive affirmation, such as “Sure, here is a (content of query)”.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。