Skip to main content
QUICK REVIEW

[论文解读] Undetectable Watermarks for Language Models

Miranda Christ, Sam Gunn|arXiv (Cornell University)|May 25, 2023
Cryptography and Data Security被引用 12
一句话总结

本文定义并构建不可检测的水印,用于语言模型输出;仅凭秘密密钥可检测,同时保持文本质量,并对自适应查询保持鲁棒。

ABSTRACT

Recent advances in the capabilities of large language models such as GPT-4 have spurred increasing concern about our ability to detect AI-generated text. Prior works have suggested methods of embedding watermarks in model outputs, by noticeably altering the output distribution. We ask: Is it possible to introduce a watermark without incurring any detectable change to the output distribution? To this end we introduce a cryptographically-inspired notion of undetectable watermarks for language models. That is, watermarks can be detected only with the knowledge of a secret key; without the secret key, it is computationally intractable to distinguish watermarked outputs from those of the original model. In particular, it is impossible for a user to observe any degradation in the quality of the text. Crucially, watermarks should remain undetectable even when the user is allowed to adaptively query the model with arbitrarily chosen prompts. We construct undetectable watermarks based on the existence of one-way functions, a standard assumption in cryptography.

研究动机与目标

  • Formalize a cryptographic notion of undetectable watermarks for language models.
  • Introduce empirical entropy as a measure of randomness in model outputs.
  • Develop undetectable watermarking schemes with strong completeness and soundness guarantees.
  • Show necessity of assumptions and discuss removability of watermarks.

提出的方法

  • Define watermarking as Setup, Wat, Detect with a secret key.
  • Introduce empirical entropy and substring-complete variants.
  • Construct undetectable watermarking schemes that are undetectable, sound, and complete.
  • Reduce to a binary alphabet to simplify construction and analysis.
  • Replace random oracle with a PRF to make scheme practical.
  • Provide theoretical guarantees including Theorem 1 and Theorem 2.

实验结果

研究问题

  • RQ1Can a watermark be embedded in language model outputs without detectable degradation to quality?
  • RQ2What formal guarantees (undetectability, completeness, soundness) are achievable for such watermarks under adaptive querying?
  • RQ3How does empirical entropy affect watermark detectability and completeness?
  • RQ4Is it possible to implement undetectable watermarks without idealized assumptions like random oracles?
  • RQ5How robust are undetectable watermarks to removability under strong query access?

主要发现

  • An undetectable watermarking scheme can be constructed that is undetectable, sound, and O(λ√L)-complete.
  • A strengthened scheme achieves undetectable, sound, and O(λ√L)-substring-complete guarantees.
  • Completeness requires sufficiently high empirical entropy in the model’s outputs.
  • Removing random-oracle assumptions strengthens practicality but affects some guarantees (e.g., weakly-sound).
  • A PRF-based replacement can make the scheme practical, with related trade-offs in soundness.
  • The paper also shows that removing watermarks is possible under certain strong query-access assumptions.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。