QUICK REVIEW

[논문 리뷰] The Statistical Signature of LLMs

Ortal Hadad, Edoardo Loru|arXiv (Cornell University)|2026. 02. 20.

Language and cultural evolution인용 수 0

한 줄 요약

본 논문은 무손실 압축을 구조적 규칙성의 모델 독립적 척도로 제시하며, 제어된, 매개된, 합성 설정에서 LLM이 생성한 텍스트를 인간의 글쓰기와 구분하고, 규모 의존적 분리(scale-dependent separation)가 있음을 보여준다.

ABSTRACT

Large language models generate text through probabilistic sampling from high-dimensional distributions, yet how this process reshapes the structural statistical organization of language remains incompletely characterized. Here we show that lossless compression provides a simple, model-agnostic measure of statistical regularity that differentiates generative regimes directly from surface text. We analyze compression behavior across three progressively more complex information ecosystems: controlled human-LLM continuations, generative mediation of a knowledge infrastructure (Wikipedia vs. Grokipedia), and fully synthetic social interaction environments (Moltbook vs. Reddit). Across settings, compression reveals a persistent structural signature of probabilistic generation. In controlled and mediated contexts, LLM-produced language exhibits higher structural regularity and compressibility than human-written text, consistent with a concentration of output within highly recurrent statistical patterns. However, this signature shows scale dependence: in fragmented interaction environments the separation attenuates, suggesting a fundamental limit to surface-level distinguishability at small scales. This compressibility-based separation emerges consistently across models, tasks, and domains and can be observed directly from surface text without relying on model internals or semantic evaluation. Overall, our findings introduce a simple and robust framework for quantifying how generative systems reshape textual production, offering a structural perspective on the evolving complexity of communication.

연구 동기 및 목표

텍스트에서 모델 내부 정보 없이도 구조적 규칙성을 정량화할 수 있는 무손실 압축의 가능성을 시연한다.
점진적으로 현실적인 설정에서 인간의 글쓰기와 LLM 생성 언어를 비교한다.
확률적 생성이 텍스트 구조를 어떻게 재형성하고 그 규모 의존성을 보이는지 특성화한다.

제안 방법

surface 텍스트를 UTF-8로 인코딩하고 gzip 기반 압축 비율 R(x) = C(x)/|x|를 계산한다.
prefix-based compression curves를 사용해 텍스트 길이에 따라 규칙성이 누적되는 정도를 측정한다.
entropy를 제어한 합성 텍스트를 생성해 압축 동작을 토큰 분포 집중도에 매핑한다.
세 가지 데이터셋을 분석한다: controlled Human–LLM corpus, Wikipedia vs Grokipedia, Moltbook vs Reddit.
추가 특징(conditional compression, prefix curve statistics, word-order measures, entropy, TTR, repetition)을 추출하고 인간 vs LLM을 구별하는 분류기를 학습한다.
SHAP 분석을 적용해 분류 작업에서 특징 중요성을 해석한다.

Figure 1 : (A) Relationship between vocabulary entropy and compression ratio for texts generated from word distributions with fixed entropy. The colored points show the average values for Humans and LLMs. The bars indicate one standard deviation from the mean. The inset displays the density distribu

실험 결과

연구 질문

RQ1무손실 압축이 probabilistic language generation의 모델 독립적 신호로 작용할 수 있는가?
RQ2구조적 규칙성(압축으로 측정)이 제어된, 매개된, 합성 설정 전반에서 인간과 기계 생성 언어 간에 어떻게 다른가?
RQ3압축 기반 시그니처가 다양한 모델 패밀리와 작업 맥락에서 지속되는가, 그리고 텍스트 길이에 따라 어떻게 확장되는가?

주요 결과

제어된 설정에서 더 높은 어휘 엔트로피는 더 높은 압축 비율(더 낮은 압축성)과 상관되며, LLM 텍스트가 일반적으로 인간 텍스트보다 더 압축 가능하다.
압축 기반 피처와 어휘 피처를 사용하는 이진 분류기는 Human vs LLM에서 0.93의 정확도, 그 이진 작업에서 0.88의 F1을 달성하며; GPT-패밀리 신호가 특히 식별 가능하다.
Wikipedia vs Grokipedia에서 더 긴 접두사에서 압축 차이가 나타나며, Grokipedia는 약간 더 낮은 조건부 압축과 더 높은 단어 수준 엔트로피를 보인다.
Moltbook vs Reddit에서는 짧은 게시물 길이에서만 차이가 관찰되며, Moltbook은 더 높은 어휘 다양성과 약간 더 덜 압축성을 보인다.
압축 기반 시그니처는 모델 패밀리와 도메인 전반에서 언어 체계를 신뢰성 있게 구분하지만, 조각화된 상호작용의 작은 규모에서는 구분이 약화된다.
결과는 의미론적 품질이 아니라 가능성 주도적 생성의 구조적 발자국을 강조한다.

Figure 2 : Distribution of structural and compression-based features across human-written and LLM-generated texts in the Human–AI Parallel Corpus.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.