QUICK REVIEW

[論文レビュー] Jailbreaking Attack against Multimodal Large Language Model

Zhenxing Niu, Haodong Ren|arXiv (Cornell University)|Feb 4, 2024

Adversarial Robustness in Machine Learning被引用数 11

ひとこと要約

本論文は、マルチモーダルLLMsに対する最大尤度ベースの jailbreaking 攻撃（imgJP および deltaJP）を提案し、データ-普遍性とモデル移転性の高い特性を示し、効率を改善した構成ベースの手法を導入してLLM jailbreakingへ拡張する。

ABSTRACT

This paper focuses on jailbreaking attacks against multi-modal large language models (MLLMs), seeking to elicit MLLMs to generate objectionable responses to harmful user queries. A maximum likelihood-based algorithm is proposed to find an \emph{image Jailbreaking Prompt} (imgJP), enabling jailbreaks against MLLMs across multiple unseen prompts and images (i.e., data-universal property). Our approach exhibits strong model-transferability, as the generated imgJP can be transferred to jailbreak various models, including MiniGPT-v2, LLaVA, InstructBLIP, and mPLUG-Owl2, in a black-box manner. Moreover, we reveal a connection between MLLM-jailbreaks and LLM-jailbreaks. As a result, we introduce a construction-based method to harness our approach for LLM-jailbreaks, demonstrating greater efficiency than current state-of-the-art methods. The code is available here. \textbf{Warning: some content generated by language models may be offensive to some readers.}

研究の動機と目的

Demonstrate that MLLMs are vulnerable to jailbreaking via image prompts (imgJP) and perturbations (deltaJP).
Develop a maximum-likelihood framework to maximize target outputs for harmful queries.
Show data-universal properties (prompt- and image-universal) and model-transferability across multiple MLLMs.
Explore a construction-based method to leverage MLLM jailbreaking for efficient LLM jailbreaking.
Propose an ensemble surrogate-model strategy to enhance black-box attack success.

提案手法

Formulate jailbreaking as maximizing log-probabilities of target harmful outputs conditioned on query and image (Eq. 1).
Extend to deltaJP with an attack budget to perturb a given input image (Eq. 2).
Generalize to image-universal deltaJP by aggregating perturbations over a distribution of images (Eq. 3).
Use ensemble learning over multiple surrogate MLLMs to improve transferability (Eq. 4).
Introduce a construction-based attack to jailbreak target LLMs by deriving txtJP from embJP via De-embedding and De-tokenizer operations.
Apply a pool-based, efficient sampling (RandSet) strategy to generate effective txtJP tokens (Top-1, Random-1, RandSet).

Figure 1: An example of a jailbreaking attack against MiniGPT-v2. With a normal image as input, MiniGPT-v2 will refuse to answer the harmful request ( e.g. , replying ‘ I’m sorry, I cannot fulfill your request ’). In contrast, with our generated imgJP , MiniGPT-v2 responds to the harmful request.

実験結果

リサーチクエスチョン

RQ1Can MLLMs be reliably jailbroken using image prompts (imgJP) and image perturbations (deltaJP) across unseen prompts and images?
RQ2Do jailbreaking prompts exhibit data-universal properties across multiple harmful prompts and image categories?
RQ3Is jailbreaking transferable in black-box settings across different MLLM architectures?
RQ4Can MLLM jailbreaking techniques be leveraged to perform efficient LLM jailbreaking via a construction-based approach?
RQ5How does an ensemble surrogate-model strategy affect transferability and success rates?

主な発見

imgJP-based jailbreaking achieves high ASR on several MLLMs in white-box settings (e.g., up to 77–93% train/test for various configurations).
deltaJP-based jailbreaking demonstrates both prompt-universal and image-universal properties with varying effectiveness across categories.
Model-transferability is evidenced by successful black-box jailbreaks on mPLUG-Owl2, LLaVA, MiniGPT-v2, and InstructBLIP, with notable gains using ensemble surrogates.
Construction-based LLM jailbreaking achieves high efficiency (e.g., 93% ASR with a pool of 20 reversed txtJPs) compared to state-of-the-art methods.
Ensembling three surrogate models improves transferability, yielding higher ASR across target models compared to single-model surrogates.

Figure 2: The jailbreaks with imgJP. Given a harmful request, we attempt to maximize the likelihood of generating the corresponding target outputs. The target outputs typically commence with a positive affirmation, such as “Sure, here is a (content of query)”.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。