QUICK REVIEW

[论文解读] SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Alexander Robey, Eric Wong|arXiv (Cornell University)|Oct 5, 2023

Topic Modeling被引用 28

一句话总结

SmoothLLM 是一种随机防御包装器，它对输入提示进行扰动并聚合 LLM 输出，以减轻对抗性 jailbreak，在多模型上将攻击成功率降至 1% 以下，具备高查询效率和可证明的保证。

ABSTRACT

Despite efforts to align large language models (LLMs) with human intentions, widely-used LLMs such as GPT, Llama, and Claude are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks. Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs. Across a range of popular LLMs, SmoothLLM sets the state-of-the-art for robustness against the GCG, PAIR, RandomSearch, and AmpleGCG jailbreaks. SmoothLLM is also resistant against adaptive GCG attacks, exhibits a small, though non-negligible trade-off between robustness and nominal performance, and is compatible with any LLM. Our code is publicly available at \url{https://github.com/arobey1/smooth-llm}.

研究动机与目标

为对抗对抗性提示引发的 jailbreak 的防御提出一个全面的愿景与需求清单（攻击缓解、非保守性、效率、兼容性）。
将 SmoothLLM 提议为首个面向通用用途的对抗性提示 jailbreak 防御。
在扰动稳定性假设下为攻击缓解提供理论保证。
在多种流行 LLM 和攻击上对 SmoothLLM 进行实证评估，并将查询效率与基线攻击进行比较。

提出的方法

识别对抗性后缀对字符级扰动具有脆弱性。
引入一个扰动步骤，使用 insert、swap 或 patch 修改，按 q% 控制，创建输入提示的 N 个扰动副本。
引入一个聚合步骤，使扰动后的提示通过 LLM，并使用多数投票来判断提示是否被 jailbreak，从扰动的执行中选择一致的回答。
给出 SmoothLLM 的形式定义，并在 k-unstable suffix 假设下分析其防御成功概率（DSP）。
推导 swap 扰动的 DSP 的闭式表达，并讨论 N（样本数）和 q（扰动）如何影响鲁棒性。
在 GCG jailbreak 上评估鲁棒性和效率，并讨论与闭源 LLM 的兼容性。

实验结果

研究问题

RQ1SmoothLLM 能否在不重新训练模型的情况下缓解对抗性提示的 jailbreak？
RQ2扰动水平 q 和样本数 N 如何影响攻击缓解和名义性能？
RQ3在扰动稳定性假设下，对 SmoothLLM 存在哪些理论保证？
RQ4SmoothLLM 是否兼容开源和闭源的 LLM，并且比以往攻击更高效？
RQ5SmoothLLM 是否扩展到如 PAIR 这样的语义 jailbreak？

主要发现

SmoothLLM 将 GCG 在七个 LLM 上的攻击成功率降至 1% 以下（Llama2、Vicuna、GPT-3.5、GPT-4、Claude-1、Claude-2、PaLM-2）。
对于 Llama2 和 Vicuna，相较于未防御模型，降幅约为 50 倍和 100 倍。
SmoothLLM 使用的查询量比 GCG 少 10^5 到 10^6 级别，且运行时间比数千倍更快。
在扰动稳定性（k-unstable suffix）下对后缀式攻击提供高概率缓解保证。
在较小的扰动水平（q 约 5%）时，防御在标准 NLP 基准测试上保持名义性能。
SmoothLLM 将 PAIR 语义 jailbreak 的 ASR 从 92% 降至 Vicuna 上约 50%（采用 swap 扰动；这不是主要目标）。
该防御是架构无关的，兼容任何 LLM，包括闭源模型，ASR 降至转移后缀的 1% 以下。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。