QUICK REVIEW

[论文解读] Large Language Models Are Bad Dice Players: LLMs Struggle to Generate Random Numbers from Statistical Distributions

Minda Zhao, Yilun Du|arXiv (Cornell University)|Jan 8, 2026

Topic Modeling被引用 0

一句话总结

论文对11个前沿大模型在15个分布下进行系统审计，显示原生采样薄弱且高度依赖协议，独立采样几乎完全失败，下游任务放大了这些缺陷。

ABSTRACT

As large language models (LLMs) transition from chat interfaces to integral components of stochastic pipelines across domains like educational assessment and synthetic data construction, the ability to faithfully sample from specified probability distributions has become a functional requirement rather than a theoretical curiosity. We present the first large-scale, statistically powered audit of native probabilistic sampling in frontier LLMs, benchmarking 11 models across 15 distributions. To disentangle failure modes, we employ a dual-protocol design: Batch Generation, where a model produces N=1000 samples within one response, and Independent Requests, comprising $N=1000$ stateless calls. We observe a sharp protocol asymmetry: batch generation achieves only modest statistical validity, with a 13% median pass rate, while independent requests collapse almost entirely, with 10 of 11 models passing none of the distributions. Beyond this asymmetry, we reveal that sampling fidelity degrades monotonically with distributional complexity and aggravates as the requested sampling horizon N increases. Finally, we demonstrate the propagation of these failures into downstream tasks: models fail to enforce uniform answer-position constraints in MCQ generation and systematically violate demographic targets in attribute-constrained text-to-image prompt synthesis. These findings indicate that current LLMs lack a functional internal sampler, necessitating the use of external tools for applications requiring statistical guarantees.

研究动机与目标

评估当前LLM在不使用外部工具的情况下，是否能够真实地从用户指定的单变量分布中进行采样。
在多样化的分布及复杂度等级上量化采样保真度。
研究采样协议（批量生成与独立请求）如何影响分布准确性。
评估下游后果在MCQ生成与属性受控提示合成中的表现。

提出的方法

使用生成分布与目标分布之间的一阶Wasserstein距离定义采样保真度度量。
使用两种协议：批量生成（在一个响应中生成N=1000个样本）和独立请求（N=1000次无状态调用）。
对11个模型在跨越三种复杂度等级的15个分布上进行基准测试。
对连续分布使用KS检验，对离散分布使用卡方检验，显著性水平设为alpha=0.01。
辅以KL散度及跨样本量N的详细收敛分析。

Figure 1: Overview of the Evaluation Pipeline. We systematically benchmark 11 frontier LLMs across 15 probability distributions spanning three complexity tiers. The evaluation employs a dual-protocol design to disentangle failure modes: Protocol A (Batch) produces samples sequentially within a singl

实验结果

研究问题

RQ1前沿LLM是否能够在内部从指定概率分布中精确采样，而不使用外部库？
RQ2采样保真度如何随分布复杂度与采样预算N的增加而变化？
RQ3批量生成是否与独立请求在揭示真实采样能力方面存在差异？
RQ4原生采样失败是否会传导到下游生成任务，如MCQ构建与属性受控提示的生成？

主要发现

独立采样在11个模型中的10个几乎完全失败，及通过率接近0。
批量生成显示适度有效，中位通过率为13%，但在分布集上顶尖模型的通过率仅有40%。
采样保真度随分布复杂度降低，Tier III分布显示出最强的不足。
Wasserstein-1距离随采样时域N的增大而上升，指示随N增大出现反比缩放与隐性降级。
下游任务呈现明显偏差：MCQ的正确答案位置非均匀；提示中的人口统计目标被违反。
LLMs缺乏一个可用的内部采样器，需要外部工具才能保证统计采样精度。

Figure 2: Distribution Complexity vs. Sampling Fidelity. (a) Statistical test pass rate decreases as distribution complexity increases from Tier I (Fundamental Priors) to Tier III (Heavy-Tailed & Complex). (b) Mean Wasserstein distance $\mathcal{W}_{1}$ increases with complexity, indicating poorer d

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。