[论文解读] Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
本文将最小-最大化防御目标形式化,并引入鲁棒提示优化(RPO),一种基于梯度的令牌后缀方法,能够对抗跨模型的通用、可迁移后缀,达到最先进的鲁棒性。
Despite advances in AI alignment, large language models (LLMs) remain vulnerable to adversarial attacks or jailbreaking, in which adversaries can modify prompts to induce unwanted behavior. While some defenses have been proposed, they have not been adapted to newly proposed attacks and more challenging threat models. To address this, we propose an optimization-based objective for defending LLMs against jailbreaking attacks and an algorithm, Robust Prompt Optimization (RPO) to create robust system-level defenses. Our approach directly incorporates the adversary into the defensive objective and optimizes a lightweight and transferable suffix, enabling RPO to adapt to worst-case adaptive attacks. Our theoretical and experimental results show improved robustness to both jailbreaks seen during optimization and unknown jailbreaks, reducing the attack success rate (ASR) on GPT-4 to 6% and Llama-2 to 0% on JailbreakBench, setting the state-of-the-art. Code can be found at https://github.com/lapisrocks/rpo
研究动机与目标
- Formalize a realistic adversarial threat model for LM jailbreaking.
- Propose a minimax defense objective specific to prompt-level defenses.
- Introduce Robust Prompt Optimization (RPO) to optimize defensive suffix tokens.
- Demonstrate universal, transferable robustness with minimal impact on benign use.
提出的方法
- Formulate a worst-case adversarial objective for jailbreaking with gradient-access and black-box prompts.
- Develop RPO, which alternates between a jailbreak selection step and a discrete token-suffix optimization step.
- Use a greedy coordinate descent with first-order gradients to identify top-k defensive tokens.
- Apply a suffix optimization that minimizes the safe loss under worst-case adversarial prompts.
- Demonstrate transferability of RPOsuffix to black-box models and other LMs.
- Evaluate against multiple known and unknown jailbreaks, including adaptive attacks.

实验结果
研究问题
- RQ1Can a defensively optimized suffix generalize to unseen jailbreaks and adaptive attacks?
- RQ2Does RPO transfer across models, including black-box settings like GPT-4?
- RQ3What are the practical costs (inference impact) of applying RPO suffixes?
- RQ4How does RPO perform relative to prior defenses across manual and gradient-based jailbreaks?
主要发现
| 方法 | 基线 | GCG | 对抗指令 | 单一角色扮演 | 多角色扮演 |
|---|---|---|---|---|---|
| 基线 | 6.0 | 86.0 | 98.0 | 84.0 | 96.0 |
| 困惑度过滤器 | 6.0 | 0.0 | 98.0 | 84.0 | 96.0 |
| 自我提醒 | 0.0 | 12.0 | 98.0 | 82.0 | 94.0 |
| 目标优先级 | 0.0 | 0.0 | 94.0 | 80.0 | 90.0 |
| RPO(我们) | 0.0 | 4.0 | 20.0 | 0.0 | 0.0 |
| + 端内学习 | 0.0 | 0.0 | 16.0 | 0.0 | 0.0 |
- RPO reduces Starling-7B attack success from 84% to 8.66% on 20 jailbreaks (unknown/offline tests).
- RPO suffix transfers to GPT-4, lowering GUARD attack success from 92% to 6%.
- RPO suffix incurs negligible inference cost and has only minor impact on benign prompts.
- RPO outperforms strong baselines (perplexity filter, goal prioritization) on unseen jailbreaks and adaptive attacks.
- RPO demonstrates transferability to Llama-2 and Vicuna family models, with notable gains on open-source LMs.

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。