QUICK REVIEW

[论文解读] Quantifying non deterministic drift in large language models

Claire Nicholson|arXiv (Cornell University)|Jan 12, 2026

Data Stream Mining Techniques被引用 0

一句话总结

本文在两种大型语言模型（gpt-4o-mini 和 llama3.1-8b）中，按提示类别、部署类型、提示模式和温度，测量基线的非确定性漂移，显示漂移在温度为0.0时仍然存在，并强调词汇指标的局限性。

ABSTRACT

Large language models (LLMs) are widely used for tasks ranging from summarisation to decision support. In practice, identical prompts do not always produce identical outputs, even when temperature and other decoding parameters are fixed. In this work, we conduct repeated-run experiments to empirically quantify baseline behavioural drift, defined as output variability observed when the same prompt is issued multiple times under operator-free conditions. We evaluate two publicly accessible models, gpt-4o-mini and llama3.1-8b, across five prompt categories using exact repeats, perturbed inputs, and reuse modes at temperatures of 0.0 and 0.7. Drift is measured using unique output fractions, lexical similarity, and word count statistics, enabling direct comparison across models, prompting modes, and deployment types. The results show that nondeterminism persists even at temperature 0.0, with distinct variability patterns by model size, deployment, and prompt type. We situate these findings within existing work on concept drift, behavioural drift, and infrastructure-induced nondeterminism, discuss the limitations of lexical metrics, and highlight emerging semantic approaches. By establishing a systematic empirical baseline in the absence of stabilisation techniques, this study provides a reference point for evaluating future drift mitigation and control methods.

研究动机与目标

在无运算符干预条件下，建立LLM非确定性漂移的基线测量。
比较模型规模、部署类型、提示模式和温度对基线漂移的影响。
将漂移测量置于现有概念漂移与基础设施 nondeterminism 文献中之情境。
提供数据和方法以支持未来的稳定化研究。

提出的方法

评估两种公开可访问的模型：gpt-4o-mini 通过 API，llama3.1-8b 本地运行。
在两种温度（0.0 和 0.7）下，对五个提示类别进行 exact repeats、perturbed inputs 与 reuse 模式的测试，逐项重复。
每个组合使用 30 次 gapfill 和 20 次 small battery prompts。
使用唯一输出分数、平均成对 Jaccard 相似度和单词数统计来衡量漂移。
讨论词汇漂移指标的局限性，并提出将来将语义指标作为改进方向。

Figure 1: Mean unique output fraction for exact repeats at temperature 0.0

实验结果

研究问题

RQ1在重复提出提示且不干预的情况下，基线行为漂移的大小是多少？
RQ2部署类型（API 服务 vs 本地开源权重）如何影响基线漂移？
RQ3提示模式（完全重复、扰动输入、重用）与温度设置如何在不同提示类别下影响漂移？
RQ4词汇指标在测量漂移方面的局限性有哪些，语义指标如何改进评估？
RQ5如何通过方差预算和吸引子区域解释漂移，以便为缓解阈值提供参考？

主要发现

Model	Temperature	Mode	Mean unique fraction	Mean Jaccard
gpt-4o-mini	0.0	exact	0.240	0.893
gpt-4o-mini	0.0	perturb	0.572	0.632
gpt-4o-mini	0.0	reuse	0.200	0.971
gpt-4o-mini	0.7	exact	0.987	0.518
gpt-4o-mini	0.7	perturb	0.000	0.440
gpt-4o-mini	0.7	reuse	0.000	0.706
llama3.1-8b	0.0	exact	0.093	0.966
llama3.1-8b	0.0	perturb	0.274	0.789
llama3.1-8b	0.0	reuse	0.100	0.910
llama3.1-8b	0.7	exact	0.987	0.471
llama3.1-8b	0.7	perturb	0.000	0.403
llama3.1-8b	0.7	reuse	0.973	0.632

即使在温度为 0.0 时也存在基线漂移，gpt-4o-mini 的漂移在约 0.24 的运行中不同，llama3.1-8b 在约 0.09 的运行中不同。
在温度 0.0 时，扰动会增加漂移（gpt-4o-mini ~0.57 的唯一输出；llama3.1-8b ~0.27）；重用模式会降低漂移（0.20 和 0.10）。
将温度提高到 0.7 时，大多数运行产生新输出且词汇相似度下降到所有模式均低于 0.52，呈现接近完全的多样性。
在两种模型中，0.0 时 exact repeats 的平均唯一分数为 0.240（gpt-4o-mini）和 0.093（llama3.1-8b），平均 Jaccard 分别为 0.893 和 0.966。
漂移幅度取决于模型规模、部署方式和提示模式，且词汇指标在捕捉语义漂移方面存在已知局限。

Figure 2: Mean average Jaccard similarity for exact repeats at temperature 0.0

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。