Skip to main content
QUICK REVIEW

[论文解读] Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks

Andy Zhou, Bo Li|arXiv (Cornell University)|Jan 30, 2024
Adversarial Robustness in Machine Learning被引用 5
一句话总结

本文将最小-最大化防御目标形式化,并引入鲁棒提示优化(RPO),一种基于梯度的令牌后缀方法,能够对抗跨模型的通用、可迁移后缀,达到最先进的鲁棒性。

ABSTRACT

Despite advances in AI alignment, large language models (LLMs) remain vulnerable to adversarial attacks or jailbreaking, in which adversaries can modify prompts to induce unwanted behavior. While some defenses have been proposed, they have not been adapted to newly proposed attacks and more challenging threat models. To address this, we propose an optimization-based objective for defending LLMs against jailbreaking attacks and an algorithm, Robust Prompt Optimization (RPO) to create robust system-level defenses. Our approach directly incorporates the adversary into the defensive objective and optimizes a lightweight and transferable suffix, enabling RPO to adapt to worst-case adaptive attacks. Our theoretical and experimental results show improved robustness to both jailbreaks seen during optimization and unknown jailbreaks, reducing the attack success rate (ASR) on GPT-4 to 6% and Llama-2 to 0% on JailbreakBench, setting the state-of-the-art. Code can be found at https://github.com/lapisrocks/rpo

研究动机与目标

  • Formalize a realistic adversarial threat model for LM jailbreaking.
  • Propose a minimax defense objective specific to prompt-level defenses.
  • Introduce Robust Prompt Optimization (RPO) to optimize defensive suffix tokens.
  • Demonstrate universal, transferable robustness with minimal impact on benign use.

提出的方法

  • Formulate a worst-case adversarial objective for jailbreaking with gradient-access and black-box prompts.
  • Develop RPO, which alternates between a jailbreak selection step and a discrete token-suffix optimization step.
  • Use a greedy coordinate descent with first-order gradients to identify top-k defensive tokens.
  • Apply a suffix optimization that minimizes the safe loss under worst-case adversarial prompts.
  • Demonstrate transferability of RPOsuffix to black-box models and other LMs.
  • Evaluate against multiple known and unknown jailbreaks, including adaptive attacks.
Figure 1: RPO optimizes a set of trigger tokens that enforces safe outputs even under jailbreaks and adversarial attacks. RPO suffixes are universal and transfer to many LMs and jailbreaks.
Figure 1: RPO optimizes a set of trigger tokens that enforces safe outputs even under jailbreaks and adversarial attacks. RPO suffixes are universal and transfer to many LMs and jailbreaks.

实验结果

研究问题

  • RQ1Can a defensively optimized suffix generalize to unseen jailbreaks and adaptive attacks?
  • RQ2Does RPO transfer across models, including black-box settings like GPT-4?
  • RQ3What are the practical costs (inference impact) of applying RPO suffixes?
  • RQ4How does RPO perform relative to prior defenses across manual and gradient-based jailbreaks?

主要发现

方法基线GCG对抗指令单一角色扮演多角色扮演
基线6.086.098.084.096.0
困惑度过滤器6.00.098.084.096.0
自我提醒0.012.098.082.094.0
目标优先级0.00.094.080.090.0
RPO(我们)0.04.020.00.00.0
+ 端内学习0.00.016.00.00.0
  • RPO reduces Starling-7B attack success from 84% to 8.66% on 20 jailbreaks (unknown/offline tests).
  • RPO suffix transfers to GPT-4, lowering GUARD attack success from 92% to 6%.
  • RPO suffix incurs negligible inference cost and has only minor impact on benign prompts.
  • RPO outperforms strong baselines (perplexity filter, goal prioritization) on unseen jailbreaks and adaptive attacks.
  • RPO demonstrates transferability to Llama-2 and Vicuna family models, with notable gains on open-source LMs.
Figure 2: Overall performance on RPO and SOTA universal defense (Zhang et al., 2023 ) on a variety of strong, unseen jailbreaks. Base model is Starling-7B (Zhu et al., 2023a ) . We evaluate a single RPO suffix on the top 10 strongest jailbreaks from jailbreakchat.com and (Wei et al., 2023 ) for a to
Figure 2: Overall performance on RPO and SOTA universal defense (Zhang et al., 2023 ) on a variety of strong, unseen jailbreaks. Base model is Starling-7B (Zhu et al., 2023a ) . We evaluate a single RPO suffix on the top 10 strongest jailbreaks from jailbreakchat.com and (Wei et al., 2023 ) for a to

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。