QUICK REVIEW

[論文レビュー] Scale Dependent Data Duplication

Joshua Kazdan, Noam Levi|arXiv (Cornell University)|Feb 18, 2026

Data Quality and Management被引用数 0

ひとこと要約

要約: この論文は、意味的重複が規模拡大に伴い有害性を増すことを示し、有限の意味的一意性を考慮するスケーリング法則を導出し、予測可能なスケーリングを回復する実用的な方法として効果的な意味的プールサイズを推定する手法を提供する。

ABSTRACT

Data duplication during pretraining can degrade generalization and lead to memorization, motivating aggressive deduplication pipelines. However, at web scale, it is unclear what constitutes a ``duplicate'': beyond surface-form matches, semantically equivalent documents (e.g. translations) may induce redundant training signals once models become sufficiently capable. Practically, this means that semantic duplicates operate increasingly like exact duplicates during training. We present evidence that duplication is scale-dependent in two ways. First, as model capability increases, cross-entropy loss gradients for semantically equivalent documents become more aligned. Smaller models, by contrast, produce gradients that reflect surface similarity (e.g., shared tokens) rather than semantic similarity. Second, we embedded all 192 million FineWeb-Edu-Dedup documents using EmbeddingGemma-300m. For moderate corpus sizes, the cosine similarity between nearest-neighbors follows an isotropic power law baseline. However, as corpus size grows to hundreds of billions of tokens, the nearest-neighbor similarities deviate sharply, indicating accelerated semantic collisions. Finally, controlled pretraining on data sampled with replacement from pools of finite unique documents shows that limited uniqueness yields mild degradation for small models, but rapidly increasing loss penalties for larger models, breaking naive scaling extrapolation. We derive explicit scaling laws that allow practitioners to estimate deviation from expected scaling due to limited semantic uniqueness of the pretraining corpus. Our results identify and resolve an unstudied source of scale-dependence, allowing for more accurate prediction at scale.

研究の動機と目的

意味的重複が学習信号へ与える影響を、モデル能力の成長とともに定量化する。
意味的に同等な文書に対して、規模拡大時に近似重複信号が出現することを示す。
表層的な類似性を超えた意味的衝突が大規模コーパスで加速することを示す。
有限の意味的一意性を組み込んだスケーリング法則を導出し、予測可能なスケーリングを回復する。
訓練データ統計から有効な意味的プールサイズを実用的に推定する方法を提供する。

提案手法

モデルスケールに応じた各文書のクロスエントロピー勾配と意味を保つ変換との勾配類似性を測定する。
大規模実世界文書セット（FineWeb-Edu-Dedup）を埋め込み、コーパスサイズごとの最近傍コサイン類似度を分析してスケーリングの break を特定する。
有限の一意性が計算資源の増加とともに性能を劣化させる様子を観察するため、制御されたデータプールでデコーディング・トランスフォーマーを訓練する。
意味論を階層的な潜在変数として理論を構築し、勾配分解（mu、delta_z、xi_x）によって有効な重複を定義する。
一つの平面法則 Delta(C,K) = a C^beta K^(-gamma) を提案し、意味的一意性が限られている場合でも予測可能なスケーリングを回復する。
近傍平均コサイン類似度（式29–34）から有効な K_eff を推定する方法を提供する。

Figure 1 : Semantic-preserving transformations yield more aligned gradients for larger/stronger models. We sample $N{=}1000$ FineWeb-Edu-Dedup documents and compute per-document gradients of normalized next-token cross-entropy (Eq. 2 ) for each model. We report mean cosine similarity between (i) unr

実験結果

リサーチクエスチョン

RQ1意味的に同等な文書は、モデル能力の向上に伴い学習勾配をより整列させるか。
RQ2コーパスサイズは意味的衝突や等方的スケーリング則からの逸脱にどのように影響するか。
RQ3有限の意味的一意性による規模依存の劣化をモデル化・補正できるか。
RQ4観測可能な訓練ストリームから有効な意味的プールサイズを推定して、スケーリングの予測可能性を回復できるか。
RQ5人工データコーパスは実データと同様のスケーリング法則の崩壊を示すか、データ多様性には何を意味するか。

主な発見

意味的重複は、能力の高いモデルで勾配の更新を整列させ、意味的重複が訓練時に厳密な重複のように振る舞うことを示す。
最近傍コサイン類似度のスケーリングは大規模コーパスでべき法則の基準から逸脱し、意味的衝突が加速していることを示す。
合成データは実データよりも早くスケーリング法則の逸脱を示し、合成データセットにおける意味的多様性が低いことを示唆する。
訓練データの有限の一意性はスケール依存の劣化を引き起こし、より大きなモデルに対して単純なスケーリング外挿を破る。
三パラメータ平面法則は計算資源とプールサイズの組み合わせで評価損失を正確に予測し、スケーラブルな予測性を回復する。
有効な意味的プールサイズ K_eff は平均最近傍コサイン類似度から推定可能で、実用的なスケーリング補正を可能にする。

Figure 2 : Semantic sensitivity emerges over training and is accelerated by scale. For a fixed model family, we compute AUC to detect whether a candidate gradient corresponds to a semantic-preserving transformation of the same document versus an unrelated document, with cosine similarity to the orig

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。