Skip to main content
QUICK REVIEW

[论文解读] The Files are in the Computer: On Copyright, Memorization, and Generative AI

A. Feder Cooper, James Grimmelmann|arXiv (Cornell University)|Apr 19, 2024
Law, AI, and Intellectual Property被引用 5
一句话总结

本论文在生成式 AI 中定义记忆,将记忆的数据视为版权拷贝,并将记忆与提取、复述、重构区分开来,以澄清版权影响。

ABSTRACT

The New York Times's copyright lawsuit against OpenAI and Microsoft alleges OpenAI's GPT models have "memorized" NYT articles. Other lawsuits make similar claims. But parties, courts, and scholars disagree on what memorization is, whether it is taking place, and what its copyright implications are. These debates are clouded by ambiguities over the nature of "memorization." We attempt to bring clarity to the conversation. We draw on the technical literature to provide a firm foundation for legal discussions, providing a precise definition of memorization: a model has "memorized" a piece of training data when (1) it is possible to reconstruct from the model (2) a near-exact copy of (3) a substantial portion of (4) that piece of training data. We distinguish memorization from "extraction" (user intentionally causes a model to generate a near-exact copy), from "regurgitation" (model generates a near-exact copy, regardless of user intentions), and from "reconstruction" (the near-exact copy can be obtained from the model by any means). Several consequences follow. (1) Not all learning is memorization. (2) Memorization occurs when a model is trained; regurgitation is a symptom not its cause. (3) A model that has memorized training data is a "copy" of that training data in the sense used by copyright. (4) A model is not like a VCR or other general-purpose copying technology; it is better at generating some types of outputs (possibly regurgitated ones) than others. (5) Memorization is not a phenomenon caused by "adversarial" users bent on extraction; it is latent in the model itself. (6) The amount of training data that a model memorizes is a consequence of choices made in training. (7) Whether or not a model that has memorized actually regurgitates depends on overall system design. In a very real sense, memorized training data is in the model--to quote Zoolander, the files are in the computer.

研究动机与目标

  • 使用精确的技术定义澄清生成式 AI 中记忆的概念。
  • 区分记忆与相关现象,如提取、复述和重构。
  • 解释被记忆的训练数据的版权影响以及模型设计如何影响记忆。
  • 通过将技术定义与法律概念对齐,为法律讨论提供指导。

提出的方法

  • 对记忆进行精确定义:如果 (1) 可以从模型中重构 (2) 近乎原样的副本 (3) 该数据的实质性部分 (4) 构成近乎原样的副本,被视为模型已经记忆了一个训练数据片段。
  • 用明确的标准将记忆与提取、复述和重构区分开。
  • 主张记忆是模型的潜在属性,受训练数据选择和系统设计的影响。
  • 借助技术文献将法律讨论建立在坚实的技术基础之上。

实验结果

研究问题

  • RQ1在生成式 AI 模型的训练数据背景下,什么算作记忆?
  • RQ2记忆、提取、复述和重构在概念上和实际应用中有何区别?
  • RQ3如果模型记忆了训练数据,版权影响是什么?
  • RQ4训练实践和系统设计在多大程度上影响记忆数据的发生与可见性?

主要发现

  • 记忆不等同于学习;模型可以记忆数据而不一定是训练的唯一结果。
  • 记忆作为训练后模型的属性存在,而不仅仅是由于对抗性用户互动。
  • 按论文的框架,记忆的片段构成对训练数据的版权拷贝。
  • 模型的复述倾向取决于整体系统设计,而不仅仅是记忆本身。
  • 记忆数据的数量受训练数据选择和训练过程的影响。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。