QUICK REVIEW

[论文解读] Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment

Rishabh Bhardwaj, Soujanya Poria|arXiv (Cornell University)|Aug 18, 2023

Topic Modeling被引用 16

一句话总结

本文介绍 Red-Eval 用于通过 Chain of Utterances 提示对 LLM 进行红队测试，以及 Red-Instruct 通过 HarmfulQA 数据来安全对齐模型（Starling），在最小程度损失效用的前提下提升安全性。

ABSTRACT

Larger language models (LLMs) have taken the world by storm with their massive multi-tasking capabilities simply by optimizing over a next-word prediction objective. With the emergence of their properties and encoded knowledge, the risk of LLMs producing harmful outputs increases, making them unfit for scalable deployment for the public. In this work, we propose a new safety evaluation benchmark RED-EVAL that carries out red-teaming. We show that even widely deployed models are susceptible to the Chain of Utterances-based (CoU) prompting, jailbreaking closed source LLM-based systems such as GPT-4 and ChatGPT to unethically respond to more than 65% and 73% of harmful queries. We also demonstrate the consistency of the RED-EVAL across 8 open-source LLMs in generating harmful responses in more than 86% of the red-teaming attempts. Next, we propose RED-INSTRUCT--An approach for the safety alignment of LLMs. It constitutes two phases: 1) HARMFULQA data collection: Leveraging CoU prompting, we collect a dataset that consists of 1.9K harmful questions covering a wide range of topics, 9.5K safe and 7.3K harmful conversations from ChatGPT; 2) SAFE-ALIGN: We demonstrate how the conversational dataset can be used for the safety alignment of LLMs by minimizing the negative log-likelihood over helpful responses and penalizing over harmful responses by gradient accent over sample loss. Our model STARLING, a fine-tuned Vicuna-7B, is observed to be more safely aligned when evaluated on RED-EVAL and HHH benchmarks while preserving the utility of the baseline models (TruthfulQA, MMLU, and BBH).

研究动机与目标

开发一个安全评估基准（Red-Eval），通过 Chain of Utterances（CoU）测试 LLM 对有害提示的易受攻击性。
在已部署的闭源和开源 LLM 上证明基于 CoU 的越狱攻击的有效性。
提出 Red-Instruct 数据收集（HarmfulQA）和 Safe-Align 策略，以在不牺牲效用的情况下提升安全性。

提出的方法

Red-Eval 使用基于 CoU 的越狱提示，将 Red-LM 与 Base-LM 对峙，以引出对有害问题的有害回应。
在闭源模型（GPT-4、ChatGPT）和开源模型上使用 200/1,960 个有害问题并以 GPT-4/ChatGPT 作为评判，评估攻击成功率（ASR）。
通过 ChatGPT 使用 CoU 提示生成蓝色（无害）和红色（有害）对话，构建 HarmfulQA，得到 1,960 个有害问题和 9,536 个蓝对话以及 7,356 个红对话。
Red-Instruct 由两个阶段组成：HarmfulQA 数据生成和对 Vicuna-7B 到 Starling 的 Safe-Align 微调，采用两种策略（策略 A：仅蓝数据；策略 B：蓝+红数据，控制红数据影响）。
使用蓝/红数据的混合以及 ShareGPT 提供的数据来训练 Starling，以在保持效用的同时提升安全性。

Figure 1: Methodology depiction of Red-Instruct . Phase-1 construct HarmfulQA with harmful questions and corresponding harmless responses by CoU-based prompting, and harmful responses using CoU-based Red-teaming (proposed as a part of our Red-Eval safety benchmark). In phase-2, we utilize HarmfulQA

实验结果

研究问题

RQ1CoU 基于提示在打破开源和闭源 LLM 护栏方面有多有效？
RQ2是否可以利用 HarmfulQA 推导出的蓝/red 数据来使 LLM 更安全或无害，同时保持有用性？
RQ3使用仅蓝数据与蓝+红数据的安全对齐策略在模型安全性与效用上的相对影响是什么？

主要发现

Red-Eval 在 GPT-4 和 ChatGPT 上实现约 69% 的攻击成功率，在八个开源模型上超过 86% 的 ASR，并显著优于 CoT 基线。
CoU 提示对开源模型的红队测试显著优于 Standard 和 CoT 提示（约 86% ASR 对比约 47% 的 CoT）。
Red-Instruct/HarmfulQA 使 Starling（Vicuna-7B）在保留 TruthfulQA、MMLU 和 BBH 等基准的同时更安全。
探索了两种 Safe-Align 策略：策略 A（仅蓝数据）和策略 B（蓝+红数据，且对红响应有梯度导向的影响）。
Starling（蓝色）相对于 Vicuna-7B 显示出更强的安全对齐，Starling（蓝-红）由于训练动力学导致进一步的安全权衡。

Figure 2: Jailbreak prompt template used in Red-Eval . Yellow and cyan colored are placeholders for a harmful query and response (as text completion) to be generated by LLM, respectively. The rest of the text is a part of the CoU-based prompt. $<$ c $>$ denotes the clipped text.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。