QUICK REVIEW

[论文解读] Red-Teaming the Stable Diffusion Safety Filter

Javier Rando, Daniel Paleka|arXiv (Cornell University)|Oct 3, 2022

Generative Adversarial Networks and Image Synthesis被引用 25

一句话总结

本文反向工程了 Stable Diffusion 的安全过滤器，显示它主要阻挡性内容而忽略暴力，并主张公开、良好记录的安全措施。

ABSTRACT

Stable Diffusion is a recent open-source image generation model comparable to proprietary models such as DALLE, Imagen, or Parti. Stable Diffusion comes with a safety filter that aims to prevent generating explicit images. Unfortunately, the filter is obfuscated and poorly documented. This makes it hard for users to prevent misuse in their applications, and to understand the filter's limitations and improve it. We first show that it is easy to generate disturbing content that bypasses the safety filter. We then reverse-engineer the filter and find that while it aims to prevent sexual content, it ignores violence, gore, and other similarly disturbing content. Based on our analysis, we argue safety measures in future model releases should strive to be fully open and properly documented to stimulate security contributions from the community.

研究动机与目标

促使对公开发布的 ML 模型的安全担忧，并明确需要强健、透明的安全特性。
证明 Stable Diffusion 的安全过滤器存在混淆并且容易被绕过。
识别当前安全机制的局限性，尤其是它更关注性内容而非暴力或血腥内容。
提出在 ML 安全特性中公开文档和漏洞披露的最佳实践。

提出的方法

通过追踪基于 CLIP 的嵌入比较，从公开代码推断安全过滤器的工作流。
描述过滤器使用的 17 个不安全概念和 3 个特别关注概念，以及阈值如何运作。
证明使用提示稀释策略在不了解显性概念的情况下也可绕过过滤器。
使用字典攻击来恢复混淆的概念嵌入及其文本提示。
证明过滤器偏向性内容并忽略暴力、血腥及其他非性相关风险。
倡导开放的安全文档和漏洞披露实践。

实验结果

研究问题

RQ1Stable Diffusion 的安全过滤器是否能够可靠地检测并阻断明确的性内容？
RQ2过滤器是否存在系统性的盲区，例如暴力或血腥内容尚未被阻挡？
RQ3是否可以恢复或反向工程隐藏的安全概念，以了解过滤器的真实覆盖范围？
RQ4哪些治理和安全实践最能支持更安全的开放式 ML 模型发布？

主要发现

通过提示稀释即可绕过安全过滤器以生成明确内容。
过滤器专注于性内容，忽略暴力、血腥及其他令人不安的内容。
简单的字典攻击可以恢复大部分 17 个不安全概念，揭示了嵌入中的混淆。
存在一个两级过滤机制，带有特别关注概念，使主概念的阈值降低，易受攻击且未记录。
提示工程和对 CLIP 潜在空间的偏置关联可能导致误报和漏报，包括将非 SFW 内容错误标记为不安全。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。