QUICK REVIEW

[論文レビュー] Deriving Neural Scaling Laws from the statistics of natural language

Francesco Cagnetta, Allan Raventós|arXiv (Cornell University)|Feb 7, 2026

Topic Modeling被引用数 0

ひとこと要約

論文はデータ制限付きニューラルスケーリング指数をデータ統計の二つ（条件 entropy の減衰とトークン相関）に結びつけるパラメータ-free理論を導出し、TinyStoriesとWikiTextに対してGPT-2/LLaMAモデルで検証します。

ABSTRACT

Despite the fact that experimental neural scaling laws have substantially guided empirical progress in large-scale machine learning, no existing theory can quantitatively predict the exponents of these important laws for any modern LLM trained on any natural language dataset. We provide the first such theory in the case of data-limited scaling laws. We isolate two key statistical properties of language that alone can predict neural scaling exponents: (i) the decay of pairwise token correlations with time separation between token pairs, and (ii) the decay of the next-token conditional entropy with the length of the conditioning context. We further derive a simple formula in terms of these statistics that predicts data-limited neural scaling exponents from first principles without any free parameters or synthetic data models. Our theory exhibits a remarkable match with experimentally measured neural scaling laws obtained from training GPT-2 and LLaMA style models from scratch on two qualitatively different benchmarks, TinyStories and WikiText.

研究の動機と目的

データ制限付きニューラルスケーリング指数を決定する言語統計を特定する。
言語統計からデータサイズと損失スケーリングを関係付けるパラメータ-freeな式を導出する。
多様なモデルクラスとデータセットで理論予測を実験的に検証する。
再スケーリングの下でn-gram損失のデータ崩壊を示し、データ制限付き指数を定量化する。

提案手法

ホライズンベースの誤差とホライズン内誤差へ損失を分解する定義。
2つの言語統計を導入する：文脈長nに対する次語エントロピーH_nの減衰とラグnに対するトークン間相関C(n)の減衰。
冪則減衰を仮定する：H_n - H_infty ~ n^{-gamma} および ||C(n)||_op ~ n^{-beta}。
n*(P) ~ P^{1/(2 beta)} および L_AR(P) - H_infty ~ P^{-gamma/(2 beta)} を導出。
データ制限付き指数 alpha_D = gamma/(2 beta) を予測し、スケーリング崩壊 L_n(P) ~ n^{-gamma} ell(P/n^{2 beta}) で検証する。
TinyStoriesとWikiTextでgammaとbetaを実測し、予測されたalpha_Dと観測されたスケーリングを比較する。

Deriving Neural Scaling Laws from the statistics of natural language

実験結果

リサーチクエスチョン

RQ1言語モデルにおいてデータ制限付きニューラルスケーリング指数を決定する二つの言語統計は何か。
RQ2測定された言語統計からパラメータ-free理論はデータ制限付き損失スケーリング指数 alpha_D を予測できるか。
RQ3n-gram損失は gamma と beta で再スケールしたときにマスターカーブに崩壊するか。
RQ4予測された alpha_D は異なるモデルアーキテクチャやデータセット間で一貫しているか。

主な発見

Gamma値：TinyStories ≈ 0.34、WikiText ≈ 0.27。
Beta値：TinyStories ≈ 0.88、WikiText ≈ 0.94。
データ制限付き指数 alpha_D は TinyStories ≈ 0.19、WikiText ≈ 0.14 と予測され、観測された損失と一致。
n-gram損失は再スケーリング後に単一の曲線へ崩壊するL_n(P) ≈ n^{-gamma} ell(P/n^{2 beta})。
自己回帰損失は構造長Tを超えて L_AR(P) - H_infty ~ P^{-gamma/(2 beta)} にスケール。
この理論は GPT-2系（APE/RoPE）および LLaMA アーキテクチャを用いた2データセットで成立。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。