QUICK REVIEW

[論文レビュー] Quantifying non deterministic drift in large language models

Claire Nicholson|arXiv (Cornell University)|Jan 12, 2026

Data Stream Mining Techniques被引用数 0

ひとこと要約

この論文は、2つのLLM（gpt-4o-miniと llama3.1-8b）におけるプロンプトカテゴリ、デプロイタイプ、プロンプティングモード、および温度ごとにベースラインの非決定論的なドリフトを測定し、温度0.0でもドリフトが継続することを示し、語彙的メトリクスの限界を浮き彫りにしています。

ABSTRACT

Large language models (LLMs) are widely used for tasks ranging from summarisation to decision support. In practice, identical prompts do not always produce identical outputs, even when temperature and other decoding parameters are fixed. In this work, we conduct repeated-run experiments to empirically quantify baseline behavioural drift, defined as output variability observed when the same prompt is issued multiple times under operator-free conditions. We evaluate two publicly accessible models, gpt-4o-mini and llama3.1-8b, across five prompt categories using exact repeats, perturbed inputs, and reuse modes at temperatures of 0.0 and 0.7. Drift is measured using unique output fractions, lexical similarity, and word count statistics, enabling direct comparison across models, prompting modes, and deployment types. The results show that nondeterminism persists even at temperature 0.0, with distinct variability patterns by model size, deployment, and prompt type. We situate these findings within existing work on concept drift, behavioural drift, and infrastructure-induced nondeterminism, discuss the limitations of lexical metrics, and highlight emerging semantic approaches. By establishing a systematic empirical baseline in the absence of stabilisation techniques, this study provides a reference point for evaluating future drift mitigation and control methods.

研究の動機と目的

オペレーターなし条件下でのLLMの非決定論的ドリフトの基準測定を確立する。
モデルサイズ、デプロイタイプ、プロンプティングモード、および温度間でベースラインドリフトを比較する。
既存の概念ドリフトおよびインフラストラクチャの非決定論性に関する文献の中で、ドリフト測定を位置づける。
将来の安定化研究を支援するデータと方法論を提供する。

提案手法

API経由のgpt-4o-miniとローカルでのllama3.1-8bという2つの公開可能なモデルを評価する。
完全一致、摂動入力、再利用モードの5つのプロンプトカテゴリを、2つの温度（0.0と0.7）で試す。
組み合わせごとにギャップフィルに30回、スモールバッテリープロンプトに20回の実行を行う。
出力のユニークな割合、ペアワイズJaccard類似度の平均、語数統計でドリフトを測定する。
語彙的ドリフト指標の限界を論じ、将来の作業として意味的メトリクスを提案する。

Figure 1: Mean unique output fraction for exact repeats at temperature 0.0

実験結果

リサーチクエスチョン

RQ1介入なしで繰り返しプロンプトを発行した場合の基礎的な行動ドリフトの大きさはどの程度か。
RQ2デプロイタイプ（API提供 vs ローカルオープンウェイト）は基準ドリフトにどのような影響を与えるか。
RQ3プロンプティングモード（完全一致、摂動入力、再利用）と温度設定は、プロンプトカテゴリ全体でドリフトにどのような影響を与えるか。
RQ4ドリフトを測定する語彙的指標にはどのような限界があり、意味的指標は評価をどのように改善できるか。
RQ5分散予算とアトラクター領域を通じてドリフトを解釈し、緩和閾値をどのように情報提供できるか。

主な発見

Model	Temperature	Mode	Mean unique fraction	Mean Jaccard
gpt-4o-mini	0.0	exact	0.240	0.893
gpt-4o-mini	0.0	perturb	0.572	0.632
gpt-4o-mini	0.0	reuse	0.200	0.971
gpt-4o-mini	0.7	exact	0.987	0.518
gpt-4o-mini	0.7	perturb	0.000	0.440
gpt-4o-mini	0.7	reuse	1.000	0.706
llama3.1-8b	0.0	exact	0.093	0.966
llama3.1-8b	0.0	perturb	0.274	0.789
llama3.1-8b	0.0	reuse	0.100	0.910
llama3.1-8b	0.7	exact	0.987	0.471
llama3.1-8b	0.7	perturb	1.000	0.403
llama3.1-8b	0.7	reuse	0.973	0.632

温度0.0でも基礎的なドリフトが存在し、gpt-4o-miniは約0.24の実行、llama3.1-8bは約0.09の実行で差が出る。
温度0.0では摂動がドリフトを高める（gpt-4o-mini約0.57のユニーク出力; llama3.1-8b約0.27）。再利用はドリフトを低減（0.20と0.10）。
温度を0.7に上げると、多くの実行で新しい出力が得られ、語彙的類似度は全モードで0.52を下回るほど大幅に低下する。
モデル間で、0.0のときの完全一致の平均ユニーク分率はgpt-4o-miniで0.240、llama3.1-8bで0.093、平均Jaccardはそれぞれ0.893と0.966。
ドリフトの大きさはモデルサイズ、デプロイ、プロンプティングモードに依存し、語彙的メトリクスは意味的ドリフトを捉えるには限界がある。

Figure 2: Mean average Jaccard similarity for exact repeats at temperature 0.0

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。