QUICK REVIEW

[论文解读] Scale Dependent Data Duplication

Joshua Kazdan, Noam Levi|arXiv (Cornell University)|Feb 18, 2026

Data Quality and Management被引用 0

一句话总结

该论文证明，语义重复在规模扩大时变得越来越有害，推导出将有限语义唯一性考虑在内的扩展规律，并提供实用方法来估计有效语义池大小以恢复预测性扩展。

ABSTRACT

Data duplication during pretraining can degrade generalization and lead to memorization, motivating aggressive deduplication pipelines. However, at web scale, it is unclear what constitutes a ``duplicate'': beyond surface-form matches, semantically equivalent documents (e.g. translations) may induce redundant training signals once models become sufficiently capable. Practically, this means that semantic duplicates operate increasingly like exact duplicates during training. We present evidence that duplication is scale-dependent in two ways. First, as model capability increases, cross-entropy loss gradients for semantically equivalent documents become more aligned. Smaller models, by contrast, produce gradients that reflect surface similarity (e.g., shared tokens) rather than semantic similarity. Second, we embedded all 192 million FineWeb-Edu-Dedup documents using EmbeddingGemma-300m. For moderate corpus sizes, the cosine similarity between nearest-neighbors follows an isotropic power law baseline. However, as corpus size grows to hundreds of billions of tokens, the nearest-neighbor similarities deviate sharply, indicating accelerated semantic collisions. Finally, controlled pretraining on data sampled with replacement from pools of finite unique documents shows that limited uniqueness yields mild degradation for small models, but rapidly increasing loss penalties for larger models, breaking naive scaling extrapolation. We derive explicit scaling laws that allow practitioners to estimate deviation from expected scaling due to limited semantic uniqueness of the pretraining corpus. Our results identify and resolve an unstudied source of scale-dependence, allowing for more accurate prediction at scale.

研究动机与目标

量化语义重复如何在模型能力提升时影响训练信号。
展示在规模化下，语义等价文档会出现近重复信号。
证明更大语料在表面相似性之外的语义碰撞加速。
推导包含有限语义唯一性的扩展规律，以恢复可预测的扩展性。
提供一种从训练数据统计中估计有效语义池大小的实用方法。

提出的方法

在模型尺度下，测量每篇文档的交叉熵梯度与保持语义的变换之间的梯度相似性。
嵌入一个大型真实文档集（FineWeb-Edu-Dedup），分析不同语料规模下的最近邻余弦相似性，以识别扩展断点。
在受控数据池下训练解码器变换器，观察有限唯一性如何随着计算量增加而劣化性能。
建立一个理论：语义是分层潜在变量，并通过梯度分解（mu、delta_z、xi_x）定义有效重复。
提出一个三参平面法则Delta(C,K)=a C^beta K^(-gamma)，在唯一性受限时恢复可预测的扩展。
提供一种从均值最近邻余弦相似性推导有效K_eff的方法（方程29–34）。

Figure 1 : Semantic-preserving transformations yield more aligned gradients for larger/stronger models. We sample $N{=}1000$ FineWeb-Edu-Dedup documents and compute per-document gradients of normalized next-token cross-entropy (Eq. 2 ) for each model. We report mean cosine similarity between (i) unr

实验结果

研究问题

RQ1语义等价文档是否会随着模型能力提升而诱发更对齐的训练梯度？
RQ2语料规模如何影响语义碰撞及对各向同性扩展规律的偏离？
RQ3我们是否能对由于有限语义唯一性导致的规模相关降解建模并校正？
RQ4如何从可观测的训练流中估计有效的语义池大小以恢复扩展的可预测性？
RQ5合成数据语料是否也会出现与真实数据相同的扩展规律坍塌，这对数据多样性意味着什么？

主要发现

在更具能力的模型中，语义重复会引发对齐的梯度更新，使语义重复在训练中表现得像完全重复。
最近邻余弦相似性的扩展偏离幂律基线，表明在大语料规模下发生加速的语义碰撞。
合成数据在现实数据之前就出现扩展规律的偏离，暗示合成数据的语义多样性较低。
训练数据中的有限唯一性会导致尺度相关的降解，打破对更大模型的简单扩展外推。
三参平面法则能在计算量和语义池规模下准确预测评估损失，恢复扩展性的可预测性。
可以从均值最近邻余弦相似性估计出有效语义池大小K_eff，从而实现实际的扩展修正。

Figure 2 : Semantic sensitivity emerges over training and is accelerated by scale. For a fixed model family, we compute AUC to detect whether a candidate gradient corresponds to a semantic-preserving transformation of the same document versus an unrelated document, with cosine similarity to the orig

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。