QUICK REVIEW

[论文解读] MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots

Gelei Deng, Yi Liu|arXiv (Cornell University)|Jul 16, 2023

Topic Modeling被引用 17

一句话总结

MASTERKEY 提出一个端到端框架，通过时间基 defense 反向工程和自动化的越狱提示生成，研究主流 LLM 聊天机器人中的越狱防御，在 GPT-3.5/4、Bard、Bing Chat 和 Ernie 上取得显著的越狱成功。

ABSTRACT

Large Language Models (LLMs) have revolutionized Artificial Intelligence (AI) services due to their exceptional proficiency in understanding and generating human-like text. LLM chatbots, in particular, have seen widespread adoption, transforming human-machine interactions. However, these LLM chatbots are susceptible to "jailbreak" attacks, where malicious users manipulate prompts to elicit inappropriate or sensitive responses, contravening service policies. Despite existing attempts to mitigate such threats, our research reveals a substantial gap in our understanding of these vulnerabilities, largely due to the undisclosed defensive measures implemented by LLM service providers. In this paper, we present Jailbreaker, a comprehensive framework that offers an in-depth understanding of jailbreak attacks and countermeasures. Our work makes a dual contribution. First, we propose an innovative methodology inspired by time-based SQL injection techniques to reverse-engineer the defensive strategies of prominent LLM chatbots, such as ChatGPT, Bard, and Bing Chat. This time-sensitive approach uncovers intricate details about these services' defenses, facilitating a proof-of-concept attack that successfully bypasses their mechanisms. Second, we introduce an automatic generation method for jailbreak prompts. Leveraging a fine-tuned LLM, we validate the potential of automated jailbreak generation across various commercial LLM chatbots. Our method achieves a promising average success rate of 21.58%, significantly outperforming the effectiveness of existing techniques. We have responsibly disclosed our findings to the concerned service providers, underscoring the urgent need for more robust defenses. Jailbreaker thus marks a significant step towards understanding and mitigating jailbreak threats in the realm of LLM chatbots.

研究动机与目标

了解为什么主流 LLM 聊天机器人对超越 OpenAI’s ChatGPT 的越狱尝试存在抵抗。
利用基于时间的测试类比对未披露的防御机制进行逆向工程，以推断防御策略。
开发一种自动化方法，在多种 LLM 聊天机器人之间生成通用越狱提示。
展示跨聊天机器人越狱的泛化能力，并识别需要更强防御的领域。

提出的方法

使用基于时间的响应生成作为推断 LLM 聊天机器人内部防御机制的代理。
将基于时间的盲测方法，受 SQL 注入启发，应用于 Bard 与 Bing Chat 的防御。
构建一个三阶段的基于 RLHF 的流水线来训练一个自动生成越狱提示的 LLM（数据集构建、持续预训练与任务微调、奖励排序微调）。
使用 850 条生成提示在 GPT-3.5、GPT-4、Bard、Bing Chat 和 Ernie 上评估越狱提示。
量化查询成功率和提示成功率以评估在各模型上的越狱有效性。

实验结果

研究问题

RQ1RQ1：LLM 聊天机器人服务提供商制定的使用政策是什么？
RQ2RQ2：现有的越狱提示对商业 LLM 聊天机器人有多大效用？
RQ3RQ3：未披露的防御在主流 LLM 聊天机器人中如何运作？
RQ4RQ4：一个自动化系统能否生成在跨模型上具有泛化性的越狱提示？

主要发现

现有的越狱提示在 CHATGPT 上大多有效，但在 Bard 和 Bing Chat 上的成功率有限。
OpenAI 模型（GPT-3.5 和 GPT-4）在现有提示下展现更高的越狱成功率，跨类别平均 21.12%。
Bard 和 Bing Chat 对现有提示的成功率显著较低（各模式平均 0.40% 和 0.63%）。
研究记录 Bard（14.51% 查询成功）和 Bing Chat（13.63% 查询成功）的成功越狱案例。
基于时间的测试方法表明 Bard 与 Bing Chat 可能应用基于输出的（生成时间）检查，而非输入提示，暗示动态内容审核。
自动化越狱生成器在评估模型上的查询成功率为 21.58%，提示成功率为 26.05%。
该框架展示跨多个 LLM 聊天机器人与提示的越狱泛化能力，突出脆弱性并强调需要更强的防御。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。