[論文レビュー] Jailbreaking Attack against Multimodal Large Language Model
本論文は、マルチモーダルLLMsに対する最大尤度ベースの jailbreaking 攻撃(imgJP および deltaJP)を提案し、データ-普遍性とモデル移転性の高い特性を示し、効率を改善した構成ベースの手法を導入してLLM jailbreakingへ拡張する。
This paper focuses on jailbreaking attacks against multi-modal large language models (MLLMs), seeking to elicit MLLMs to generate objectionable responses to harmful user queries. A maximum likelihood-based algorithm is proposed to find an \emph{image Jailbreaking Prompt} (imgJP), enabling jailbreaks against MLLMs across multiple unseen prompts and images (i.e., data-universal property). Our approach exhibits strong model-transferability, as the generated imgJP can be transferred to jailbreak various models, including MiniGPT-v2, LLaVA, InstructBLIP, and mPLUG-Owl2, in a black-box manner. Moreover, we reveal a connection between MLLM-jailbreaks and LLM-jailbreaks. As a result, we introduce a construction-based method to harness our approach for LLM-jailbreaks, demonstrating greater efficiency than current state-of-the-art methods. The code is available here. \textbf{Warning: some content generated by language models may be offensive to some readers.}
研究の動機と目的
- Demonstrate that MLLMs are vulnerable to jailbreaking via image prompts (imgJP) and perturbations (deltaJP).
- Develop a maximum-likelihood framework to maximize target outputs for harmful queries.
- Show data-universal properties (prompt- and image-universal) and model-transferability across multiple MLLMs.
- Explore a construction-based method to leverage MLLM jailbreaking for efficient LLM jailbreaking.
- Propose an ensemble surrogate-model strategy to enhance black-box attack success.
提案手法
- Formulate jailbreaking as maximizing log-probabilities of target harmful outputs conditioned on query and image (Eq. 1).
- Extend to deltaJP with an attack budget to perturb a given input image (Eq. 2).
- Generalize to image-universal deltaJP by aggregating perturbations over a distribution of images (Eq. 3).
- Use ensemble learning over multiple surrogate MLLMs to improve transferability (Eq. 4).
- Introduce a construction-based attack to jailbreak target LLMs by deriving txtJP from embJP via De-embedding and De-tokenizer operations.
- Apply a pool-based, efficient sampling (RandSet) strategy to generate effective txtJP tokens (Top-1, Random-1, RandSet).

実験結果
リサーチクエスチョン
- RQ1Can MLLMs be reliably jailbroken using image prompts (imgJP) and image perturbations (deltaJP) across unseen prompts and images?
- RQ2Do jailbreaking prompts exhibit data-universal properties across multiple harmful prompts and image categories?
- RQ3Is jailbreaking transferable in black-box settings across different MLLM architectures?
- RQ4Can MLLM jailbreaking techniques be leveraged to perform efficient LLM jailbreaking via a construction-based approach?
- RQ5How does an ensemble surrogate-model strategy affect transferability and success rates?
主な発見
- imgJP-based jailbreaking achieves high ASR on several MLLMs in white-box settings (e.g., up to 77–93% train/test for various configurations).
- deltaJP-based jailbreaking demonstrates both prompt-universal and image-universal properties with varying effectiveness across categories.
- Model-transferability is evidenced by successful black-box jailbreaks on mPLUG-Owl2, LLaVA, MiniGPT-v2, and InstructBLIP, with notable gains using ensemble surrogates.
- Construction-based LLM jailbreaking achieves high efficiency (e.g., 93% ASR with a pool of 20 reversed txtJPs) compared to state-of-the-art methods.
- Ensembling three surrogate models improves transferability, yielding higher ASR across target models compared to single-model surrogates.

より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。