QUICK REVIEW

[論文レビュー] Large Language Models Are Bad Dice Players: LLMs Struggle to Generate Random Numbers from Statistical Distributions

Minda Zhao, Yilun Du|arXiv (Cornell University)|Jan 8, 2026

Topic Modeling被引用数 0

ひとこと要約

この論文は frontier LLMs を 15 の分布にわたり 11 のモデルを監査し、ネイティブサンプリングが弱く、プロトコル依存性が高いことを示す。独立サンプリングはほぼ全く機能せず、下流タスクでこれらの欠陥が拡大する。

ABSTRACT

As large language models (LLMs) transition from chat interfaces to integral components of stochastic pipelines across domains like educational assessment and synthetic data construction, the ability to faithfully sample from specified probability distributions has become a functional requirement rather than a theoretical curiosity. We present the first large-scale, statistically powered audit of native probabilistic sampling in frontier LLMs, benchmarking 11 models across 15 distributions. To disentangle failure modes, we employ a dual-protocol design: Batch Generation, where a model produces N=1000 samples within one response, and Independent Requests, comprising $N=1000$ stateless calls. We observe a sharp protocol asymmetry: batch generation achieves only modest statistical validity, with a 13% median pass rate, while independent requests collapse almost entirely, with 10 of 11 models passing none of the distributions. Beyond this asymmetry, we reveal that sampling fidelity degrades monotonically with distributional complexity and aggravates as the requested sampling horizon N increases. Finally, we demonstrate the propagation of these failures into downstream tasks: models fail to enforce uniform answer-position constraints in MCQ generation and systematically violate demographic targets in attribute-constrained text-to-image prompt synthesis. These findings indicate that current LLMs lack a functional internal sampler, necessitating the use of external tools for applications requiring statistical guarantees.

研究の動機と目的

現在の LLM が外部ツールなしで、ユーザー指定の 1D 分布から正確にサンプリングできるかを評価する。
多様な分布と複雑性レベルのセットでサンプリングの忠実度を定量化する。
サンプリング・プロトコル（バッチ生成 vs 独立リクエスト）が分布精度に与える影響を調査する。
MCQ 生成と属性制約付きプロンプトの設計における下流影響を評価する。

提案手法

生成分布とターゲット分布間の Wasserstein-1 距離を用いたサンプリング忠実度指標を定義する。
二つのプロトコルを使用する：Batch Generation（1 回の応答で N=1000 サンプル）と Independent Requests（N=1000 の状態を持たない呼び出し）。
11モデルを 3 つの複雑性レベルにまたがる 15 の分布でベンチマークする。
連続分布には KS 検定、離散分布にはカイ二乗検定を適用し、α=0.01。
サンプルサイズ N を跨いだ詳細な収束解析とともに KL 発散も補足する。

Figure 1: Overview of the Evaluation Pipeline. We systematically benchmark 11 frontier LLMs across 15 probability distributions spanning three complexity tiers. The evaluation employs a dual-protocol design to disentangle failure modes: Protocol A (Batch) produces samples sequentially within a singl

実験結果

リサーチクエスチョン

RQ1 frontier LLM が外部ライブラリなしで指定分布を内部で正確にサンプリングできるか？
RQ2分布の複雑性とサンプリング予算 N とともにサンプリング忠実度はどうスケールするか？
RQ3バッチ生成は独立リクエストと比較して真のサンプリング能力を明らかにするか？
RQ4ネイティブサンプリングの失敗は MCQ 構成や属性制御プロンプトといった下流生成タスクへ伝播するか？

主な発見

独立サンプリングは 11Model のうち 10 モデルでほぼ完全に失敗し、合格率はほぼゼロに近い。
バッチ生成は妥当性が控えめで中央値合格率が 13%、ただし分布セットでトップモデルは 40% に留まる。
分布の複雑性が高まるとサンプリング忠実度は低下し、Tier III 分布が最も大きな欠陥を示す。
Wasserstein-1 距離はサンプリング・ホライズン N の増加とともに上昇し、N が大きくなると逆スケーリングと潜在的な劣化を示す。
下流タスクには明確な偏りが現れ：MCQ の正解位置が一様でなく、プロンプトの属性ターゲットが守られない。
LLMs は機能的な内部サンプラーを欠き、統計的サンプリング精度を保証するには外部ツールが必要。

Figure 2: Distribution Complexity vs. Sampling Fidelity. (a) Statistical test pass rate decreases as distribution complexity increases from Tier I (Fundamental Priors) to Tier III (Heavy-Tailed & Complex). (b) Mean Wasserstein distance $\mathcal{W}_{1}$ increases with complexity, indicating poorer d

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。