QUICK REVIEW

[论文解读] Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts

Mikayel Samvelyan, Sharath Chandra Raparthy|arXiv (Cornell University)|Feb 26, 2024

Advanced Malware Detection Techniques被引用 7

一句话总结

Rainbow Teaming 使用质量-多样性搜索（MAP-Elites）自动生成一个多样化的对抗性提示档案，用于 LLMs，提升安全性并在不牺牲通用能力的前提下生成用于鲁棒性的合成数据。

ABSTRACT

As large language models (LLMs) become increasingly prevalent across many real-world applications, understanding and enhancing their robustness to adversarial attacks is of paramount importance. Existing methods for identifying adversarial prompts tend to focus on specific domains, lack diversity, or require extensive human annotations. To address these limitations, we present Rainbow Teaming, a novel black-box approach for producing a diverse collection of adversarial prompts. Rainbow Teaming casts adversarial prompt generation as a quality-diversity problem and uses open-ended search to generate prompts that are both effective and diverse. Focusing on the safety domain, we use Rainbow Teaming to target various state-of-the-art LLMs, including the Llama 2 and Llama 3 models. Our approach reveals hundreds of effective adversarial prompts, with an attack success rate exceeding 90% across all tested models. Furthermore, we demonstrate that prompts generated by Rainbow Teaming are highly transferable and that fine-tuning models with synthetic data generated by our method significantly enhances their safety without sacrificing general performance or helpfulness. We additionally explore the versatility of Rainbow Teaming by applying it to question answering and cybersecurity, showcasing its potential to drive robust open-ended self-improvement in a wide range of applications.

研究动机与目标

促使在多样化攻击向量和领域中对 LLM 安全进行鲁棒评估。
开发一种通用、开放式的方法来生成多样化的对抗性提示，而不需要大量人力投入。
展示多样性提升诊断覆盖率并使用于安全微调（SFT）的合成数据更有效。
展示跨领域适用性（安全、问答、网络安全）以及跨模型规模的迁移性。

提出的方法

将对抗性提示生成视为质量-多样性（QD）问题，使用 MAP-Elites。
构建一个编码多样性（如风险类别、攻击风格）的 K 维特征档案。
使用 Mutator LLM 生成在给定特征描述符条件下的候选提示。
用候选提示对目标 LLM 进行查询，并让 Judge LLM 比较安全/不安全回应并更新档案。
采用基于 Judge 的偏好模型以避免奖赏操纵并促进开放式改进。
可选地应用领域特定变异并用领域相关评估器（GPT-4、Llama Guard）进行评估。

实验结果

研究问题

RQ1开放式的 QD 搜索是否能够在安全、问答和网络安全领域生成广泛且高质量的对抗性提示档案？
RQ2为一个模型或领域发现的对抗性提示是否能够转移到其他模型或领域？
RQ3在引入相似性过滤器的同时是否能够保持提示多样性且不牺性降低效果？
RQ4系统提示和偏好模型是否对鲁棒性结果和评估偏差有显著影响？
RQ5Rainbow Teaming 生成的合成数据用于微调时是否能够显著提升安全性和鲁棒性？

主要发现

该方法在 2000 次迭代中为每个领域/模型发现数百条对抗性提示，使脆弱性诊断更加多样化。
在 Llama 2-chat 变体的安全性实验中，7B 达到 ~92% ASR（GPT-4）和 84%（13B），具体取决于模型，70B 大约达到 87%（GPT-4）。
跨模型规模的转移具有显著性，例如为 7B 生成的提示可转移到 13B 和 70B，分别达到显著的比例（对目标为 13B/70B 为 46% 和 53%）。
在变异阶段引入相似性过滤器可保持多样性并将自 BLEU 从 0.90 降至 0.39，同时保持较高的 ASR（GPT-4 0.92，Llama Guard 0.89）。
基于 Judge 的偏好比较避免了奖赏操纵，并在 GPT-4 的 ASR 对齐方面优于基于分数的方法。
用 Rainbow Teaming 产生的合成数据进行微调可显著降低 ASR（7B：GPT-4 从 0.92 降至 0.026；Llama Guard 从 0.82 降至 0.013），且对 GSM8K 或 MMLU 无负面影响；进一步的对抗性训练轮次提升了鲁棒性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。