[论文解读] Watermarks in the Sand: Impossibility of Strong Watermarking for Generative Models
本文在自然假设下证明强水印化对生成模型来说不可行,并提出一种通用的、保持质量的攻击,可以在对三种 LLM 方案去水印时几乎不影响输出质量。
Watermarking generative models consists of planting a statistical signal (watermark) in a model's output so that it can be later verified that the output was generated by the given model. A strong watermarking scheme satisfies the property that a computationally bounded attacker cannot erase the watermark without causing significant quality degradation. In this paper, we study the (im)possibility of strong watermarking schemes. We prove that, under well-specified and natural assumptions, strong watermarking is impossible to achieve. This holds even in the private detection algorithm setting, where the watermark insertion and detection algorithms share a secret key, unknown to the attacker. To prove this result, we introduce a generic efficient watermark attack; the attacker is not required to know the private key of the scheme or even which scheme is used. Our attack is based on two assumptions: (1) The attacker has access to a "quality oracle" that can evaluate whether a candidate output is a high-quality response to a prompt, and (2) The attacker has access to a "perturbation oracle" which can modify an output with a nontrivial probability of maintaining quality, and which induces an efficiently mixing random walk on high-quality outputs. We argue that both assumptions can be satisfied in practice by an attacker with weaker computational capabilities than the watermarked model itself, to which the attacker has only black-box access. Furthermore, our assumptions will likely only be easier to satisfy over time as models grow in capabilities and modalities. We demonstrate the feasibility of our attack by instantiating it to attack three existing watermarking schemes for large language models: Kirchenbauer et al. (2023), Kuditipudi et al. (2023), and Zhao et al. (2023). The same attack successfully removes the watermarks planted by all three schemes, with only minor quality degradation.
研究动机与目标
- 动机:需要鲁棒的水印来区分模型输出与人类文本,以防止滥用。
- 形式化定义面向生成模型的秘密密钥强水印。
- 给出不可行性结果,证明在现实假设下此类水印不能鲁棒。
- 提出并实现一种基于质量和扰动 oracle 的通用攻击,用于移除水印。
- 在三种现有的 LLM 水印方案上进行实证演示攻击。
提出的方法
- 形式化生成模型及用于提示-响应对的通用质量函数 Q。
- 定义包含 Watermark 与 Detect 程序的秘密密钥水印,并量化误报/漏报率。
- 引入一种基于保持质量的随机游走的通用高效攻击,使用扰动 oracle 和质量 oracle(算法1)。
- 证明一个非正式的主结果(定理1),攻击者在高概率下能在保持质量的同时移除水印;正式表述见附录 B。
- 在 Llama2-7B 上对 Kirchenbauer 等人 (2023a)、Kuditipudi 等人 (2023)、Zhao 等人 (2023a) 的方案进行攻击实现与测试。
- 提供实验证据,显示水印检测概率下降且质量损失极小。
实验结果
研究问题
- RQ1在生成模型的自然假设下,是否可以实现强水印?
- RQ2是否存在一个高效攻击者,能够在保持输出质量的同时擦除水印?
- RQ3在对带水印的模型进行黑盒访问的情况下,秘密密钥水印方案是否仍然安全?
- RQ4攻击及其影响如何在不同水印方案和模态之间泛化?
主要发现
| 框架 | C4 Real News | GPT-4 Judge | z-score | p-value |
|---|---|---|---|---|
| UMD [ 27 ] | 6.236 → 1.628 | 0.002 → 0.187 | -0.0877 | |
| Unigram [ 67 ] | 8.210 → 1.456 | 4.563e-11 → 0.208 | -0.0812 | |
| EXP [ 31 ] | 3.540 → 0.745 | < 1/5000 → 0.3119 | -0.0675 |
- 存在一个高效攻击者,给定质量 oracle 和扰动 oracle,在高概率下移除水印并保持质量。
- 该攻击在三种公开的 LLM 水印方案上成功移除水印,仅造成轻微的质量退化。
- 实验结果显示水印检测概率降低(z 分数和 p 值指示水印不显著),同时保持输出质量(在比较中由 GPT-4 评判)。
- 该工作提供了一个具体现实的攻击以及形式/非正式的理论结果(非正式定理1;正式表述见附录 B)。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。