[論文レビュー] The Files are in the Computer: On Copyright, Memorization, and Generative AI
本論文は生成AIにおける memorization を定義し、memorized data が著作権コピーに該当すると主張し、 memorization を extraction、regurgitation、reconstruction から区別して著作権の意味を明確化します。
The New York Times's copyright lawsuit against OpenAI and Microsoft alleges OpenAI's GPT models have "memorized" NYT articles. Other lawsuits make similar claims. But parties, courts, and scholars disagree on what memorization is, whether it is taking place, and what its copyright implications are. These debates are clouded by ambiguities over the nature of "memorization." We attempt to bring clarity to the conversation. We draw on the technical literature to provide a firm foundation for legal discussions, providing a precise definition of memorization: a model has "memorized" a piece of training data when (1) it is possible to reconstruct from the model (2) a near-exact copy of (3) a substantial portion of (4) that piece of training data. We distinguish memorization from "extraction" (user intentionally causes a model to generate a near-exact copy), from "regurgitation" (model generates a near-exact copy, regardless of user intentions), and from "reconstruction" (the near-exact copy can be obtained from the model by any means). Several consequences follow. (1) Not all learning is memorization. (2) Memorization occurs when a model is trained; regurgitation is a symptom not its cause. (3) A model that has memorized training data is a "copy" of that training data in the sense used by copyright. (4) A model is not like a VCR or other general-purpose copying technology; it is better at generating some types of outputs (possibly regurgitated ones) than others. (5) Memorization is not a phenomenon caused by "adversarial" users bent on extraction; it is latent in the model itself. (6) The amount of training data that a model memorizes is a consequence of choices made in training. (7) Whether or not a model that has memorized actually regurgitates depends on overall system design. In a very real sense, memorized training data is in the model--to quote Zoolander, the files are in the computer.
研究の動機と目的
- Clarify the concept of memorization in generative AI using precise technical definitions.
- Differentiate memorization from related phenomena such as extraction, regurgitation, and reconstruction.
- Explain copyright implications of memorized training data and how model design affects memorization.
- Provide guidance for legal discussions by aligning technical definitions with legal concepts.
提案手法
- Provide a precise definition of memorization: a model has memorized a training data piece if (1) it is possible to reconstruct from the model (2) a near-exact copy of (3) a substantial portion of that data (4) fabricates a near-exact copy.
- Distinguish memorization from extraction, regurgitation, and reconstruction with clear criteria.
- Argue that memorization is a latent property of the model and influenced by training data choices and system design.
- Draw on technical literature to ground legal discussions in a firm technical foundation.
実験結果
リサーチクエスチョン
- RQ1What counts as memorization in the context of training data for generative AI models?
- RQ2How do memorization, extraction, regurgitation, and reconstruction differ conceptually and practically?
- RQ3What are the copyright implications if a model memorizes training data?
- RQ4To what extent do training practices and system design influence the occurrence and visibility of memorized data?
主な発見
- Memorization is not identical to learning; a model can memorize data without it being the sole outcome of training.
- Memorization occurs as a property of the trained model, not solely due to adversarial user interaction.
- A memorized piece constitutes a copy of training data for copyright purposes under the paper's framing.
- A model's propensity to regurgitate depends on overall system design, not just memorization alone.
- The amount of memorized data is influenced by training data choices and training processes.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。