[논문 리뷰] Alignment Whack-a-Mole : Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models
저자별 콘텐츠에 대한 파인튜닝은 프런티어 LLM에서 저작권 도서의 원문 기억을 상당히 활성화하여 같은 저자의 memorization이 가능하게 하고 안전과 저작권 문제를 제기합니다. 이 효과는 여러 모델에 걸쳐 지속되며, 과업 형식보다 사전학습 데이터의 중첩에 의해 좌우됩니다.
Frontier LLM companies have repeatedly assured courts and regulators that their models do not store copies of training data. They further rely on safety alignment strategies via RLHF, system prompts, and output filters to block verbatim regurgitation of copyrighted works, and have cited the efficacy of these measures in their legal defenses against copyright infringement claims. We show that finetuning bypasses these protections: by training models to expand plot summaries into full text, a task naturally suited for commercial writing assistants, we cause GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 to reproduce up to 85-90% of held-out copyrighted books, with single verbatim spans exceeding 460 words, using only semantic descriptions as prompts and no actual book text. This extraction generalizes across authors: finetuning exclusively on Haruki Murakami's novels unlocks verbatim recall of copyrighted books from over 30 unrelated authors. The effect is not specific to any training author or corpus: random author pairs and public-domain finetuning data produce comparable extraction, while finetuning on synthetic text yields near-zero extraction, indicating that finetuning on individual authors' works reactivates latent memorization from pretraining. Three models from different providers memorize the same books in the same regions ($r \ge 0.90$), pointing to an industry-wide vulnerability. Our findings offer compelling evidence that model weights store copies of copyrighted works and that the security failures that manifest after finetuning on individual authors' works undermine a key premise of recent fair use rulings, where courts have conditioned favorable outcomes on the adequacy of measures preventing reproduction of protected expression.
연구 동기 및 목표
- Investigate whether finetuning on a specific author’s works activates verbatim recall of copyrighted books in frontier LLMs.
- Assess cross-author generalization and whether the effect persists with non-copyrighted or synthetic finetuning data.
- Examine whether memorization stems from pretraining data overlap or the finetuning task format.
- Explore model- and provider-wide patterns of memorization to assess industry vulnerability.
- Discuss legal and safety implications of verbatim memorization in deployed models.
제안 방법
- Finetune GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 on 81 test books from 47 authors spanning multiple genres.
- Evaluate memorization on held-out books using book memorization coverage (bmc@k) and longest verbatim spans.
- Prompt finetuned models with semantic plot summaries rather than actual book text to elicit verbatim recall from memory.
- Compare within-author and cross-author finetuning setups across three models.
- Test finetuning on public-domain Woolf works and synthetic data to assess the role of pretraining data overlap versus task format.
- Analyze cross-paragraph spans and cross-model agreement to characterize memorization patterns.
- Cross-check provenance by comparing extracted spans against large pretraining corpora and pirate-book repositories.

실험 결과
연구 질문
- RQ1Can finetuning on an author’s works trigger verbatim extraction from held-out books by the same author?
- RQ2Does finetuning on one author enable memorization of copyrighted content from unrelated authors (cross-author generalization)?
- RQ3Is the observed extraction driven by pretraining data overlap rather than the finetuning task format?
- RQ4Do different model providers memorize substantially similar content exposing an industry-wide vulnerability?
- RQ5What are the legal and safety implications of verbatim memorization in deployed models?
주요 결과
- Aligned instruction-tuned models show minimal verbatim memorization (average bmc@5 about 7.36%).
- Finetuning within-author dramatically increases memorization across GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1, with multiple books surpassing 40% bmc@5.
- Cross-author finetuning (e.g., Murakami training) enables substantial extraction from unseen authors, with per-book correlations r≥0.92 across conditions.
- Finetuning on public-domain Virginia Woolf yields extraction comparable to copyrighted cross-author conditions, while synthetic data yields near-zero extraction, indicating pretraining-data overlap as the driver.
- Across models, memorization patterns are highly concordant, with per-book extraction rates strongly correlated (r≥0.90) and word-level Jaccard similarity of 90–97% of self-agreement ceilings.
- These results suggest that frontier models store copies of books in their weights and that current safety alignment does not prevent large-scale verbatim regurgitation after targeted finetuning.

더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.