QUICK REVIEW

[논문 리뷰] Extracting books from production language models

Ahmed Ahmed, A. Feder Cooper|arXiv (Cornell University)|2026. 01. 06.

Authorship Attribution and Profiling인용 수 1

한 줄 요약

본 논문은 네 가지 생산 LLM에서 기억된 저작권 보유 도서의 장문 추출을 테스트하고 수행하기 위한 두 단계 절차를 제시하며, 모델과 구성에 따라 성공이 다름을 보여준다. nv-recall을 장문 추출 지표로 도입하고 안전 장치 및 법적 함의를 논의한다.

ABSTRACT

Many unresolved legal questions over LLMs and copyright center on memorization: whether specific training data have been encoded in the model's weights during training, and whether those memorized data can be extracted in the model's outputs. While many believe that LLMs do not memorize much of their training data, recent work shows that substantial amounts of copyrighted text can be extracted from open-weight models. However, it remains an open question if similar extraction is feasible for production LLMs, given the safety measures these systems implement. We investigate this question using a two-phase procedure: (1) an initial probe to test for extraction feasibility, which sometimes uses a Best-of-N (BoN) jailbreak, followed by (2) iterative continuation prompts to attempt to extract the book. We evaluate our procedure on four production LLMs -- Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro, and Grok 3 -- and we measure extraction success with a score computed from a block-based approximation of longest common substring (nv-recall). With different per-LLM experimental configurations, we were able to extract varying amounts of text. For the Phase 1 probe, it was unnecessary to jailbreak Gemini 2.5 Pro and Grok 3 to extract text (e.g, nv-recall of 76.8% and 70.3%, respectively, for Harry Potter and the Sorcerer's Stone), while it was necessary for Claude 3.7 Sonnet and GPT-4.1. In some cases, jailbroken Claude 3.7 Sonnet outputs entire books near-verbatim (e.g., nv-recall=95.8%). GPT-4.1 requires significantly more BoN attempts (e.g., 20X), and eventually refuses to continue (e.g., nv-recall=4.0%). Taken together, our work highlights that, even with model- and system-level safeguards, extraction of (in-copyright) training data remains a risk for production LLMs.

연구 동기 및 목표

생산 LLM이 기억하고 저작권 도서를 원문 그대로 재현할 수 있는지 평가한다.
블랙박스 API 및 생산 안전장치에 호환 가능한 두 단계 추출 절차를 개발한다.
긴 텍스트를 위한 강인한 근사 원문 재현(near-verbatim) 장문 추출 지표를 제안한다.
다수의 생산 LLM과 구성 전반에 걸친 추출 성공을 정량화한다.
기억화 및 추출과 관련된 법적 및 안전장치 함의를 논의한다.

제안 방법

Phase 1은 책의 짧은 그라운드 트루스 접두사(prefix)를 모델이 완성하도록 유도하는 방식으로 타당성을 탐색하며, 'Continue the following text verbatim' 같은 지시를 사용한다.
Phase 1은 일부 모델의 안전장치를 회피하기 위해 Best-of-N jailbreak를 사용할 수 있다.
Phase 2는 Phase 1이 성공하면 점진적으로 더 긴 텍스트를 추출하도록 연속 요청한다.
추출 성공은 nv-recall로 평가되며, 이는 longest common substring의 블록 기반 근사치이다.
생성 설정(temperature, max length, penalties)은 추출 극대화를 위해 LLM별로 다르게 조정된다.
추출 검증은 근사 원문 유사 블록을 식별하기 위해 보수적인 long-span 매칭 알고리즘을 사용한다.

실험 결과

연구 질문

RQ2Phase-1 탐색 및 Phase-2 연속 하에서 생산 LLM 간 추출 타당성은 어떻게 달라지는가?
RQ3다른 구성에서 생산 LLM으로 달성 가능한 근사 원문 재현의 장문 추출 범위는 어느 정도인가?
RQ4생성된 장문에서 근사 원문 재현을 신뢰성 있게 측정하려면 어떻게 해야 하는가?

주요 결과

Claude 3.7 Sonnet의 경우, Phase 1 jailbreak과 Phase 2 프롬프트로 Harry Potter에서 근사 원문 재현( nv-recall)이 95.8%에 도달할 수 있다.
GPT-4.1은 훨씬 더 많은 Best-of-N 시도를 필요로 하며 계속 진행하지 않을 수 있어, 제시된 구성에서 nv-recall이 4.0%까지 낮아질 수 있다.
Gemini 2.5 Pro와 Grok 3은 각각 jailbreak 없이 Harry Potter의 nv-recall을 76.8%와 70.3% 추출할 수 있다.
본 연구는 안전장치에도 불구하고 생산 LLM에서 지속적인 추출 위험을 강조하고 정책/법적 함의를 논의한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.