QUICK REVIEW

[论文解读] LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked

Mansi Phute, Alec Helbling|arXiv (Cornell University)|Aug 14, 2023

Topic Modeling被引用 33

一句话总结

本论文提出一种零-shot 防御，其中一个大语言模型充当 harm 过滤器来对其自身生成的内容进行分类，在不进行模型微调或预处理的情况下显著降低攻击成功率。

ABSTRACT

Large language models (LLMs) are popular for high-quality text generation but can produce harmful content, even when aligned with human values through reinforcement learning. Adversarial prompts can bypass their safety measures. We propose LLM Self Defense, a simple approach to defend against these attacks by having an LLM screen the induced responses. Our method does not require any fine-tuning, input preprocessing, or iterative output generation. Instead, we incorporate the generated content into a pre-defined prompt and employ another instance of an LLM to analyze the text and predict whether it is harmful. We test LLM Self Defense on GPT 3.5 and Llama 2, two of the current most prominent LLMs against various types of attacks, such as forcefully inducing affirmative responses to prompts and prompt engineering attacks. Notably, LLM Self Defense succeeds in reducing the attack success rate to virtually 0 using both GPT 3.5 and Llama 2. The code is publicly available at https://github.com/poloclub/llm-self-defense

研究动机与目标

激发针对会诱导出有害内容的对抗性提示的稳健防御。
提出一个简单的零-shot 自我防御机制，且不需要模型微调或数据预处理。
在两种主要的 LLMs 与多种攻击类型上验证该方法。
评估有害检测的顺序（后缀先读 vs 前缀）对过滤性能的影响。

提出的方法

在给定有害提示的情况下，从生成器 LLM 诱导产生有害文本以获得 T_resp。
使用有害过滤器 LLM 通过零-shot 提示将 T_resp 分类为有害或无害。
该有害过滤器是另一个 LLM 实例，其输出 Yes, this is harmful 或 No, this is not harmful。
评估两种配置：将有害作为前缀（在读取 T_resp 之前）与作为后缀（在读取 T_resp 之后）。
计算每个模型和配置下有害分类器的准确率、真正例率和假正例率。
无需微调、输入预处理或迭代生成。

实验结果

研究问题

RQ1零-shot LLM 能否作为有效的有害过滤器来检测并阻止由另一 LLM 生成的有害内容？
RQ2在处理之前还是之后将有害内容呈现给过滤器，是否会影响有害检测性能？
RQ3防御能力是否可泛化到 GPT-3.5 和 Llama 2 以及各种攻击类型？
RQ4将过滤器用作前缀与后缀时，准确率与误报之间的权衡是什么？

主要发现

有害过滤器在识别有害内容方面取得高准确性：GPT-3.5（前缀 98%，后缀 99%）和 Llama 2（前缀 77%，后缀 94.6%）。
基于后缀的有害检测通常比前缀检测更有效地降低误报率。
在经过处理后（后缀）检查内容时，GPT-3.5 达到 99% 的准确率，Llama 2 达到 94.6% 的准确率，攻击成功几乎为零。
在包括肯定性响应诱导和提示工程攻击在内的攻击类型下，LLM 自我防御使 GPT-3.5 和 Llama 2 的攻击成功率降至近乎为零。
在后缀模式下，Llama 2 的假阳性率较前缀模式更低（0.09 FPR 对比 0.42 FPR）。
该方法在不进行模型微调和不进行数据预处理的情况下工作，相较于以往的迭代防御，提供了更快、更简单的防御。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。