QUICK REVIEW

[论文解读] When Stability Fails: Hidden Failure Modes Of LLMS in Data-Constrained Scientific Decision-Making

Nazia Riasat|arXiv (Cornell University)|Mar 16, 2026

Scientific Computing and Data Management被引用 0

一句话总结

本文提出了一种受控行为评估，将稳定性、正确性、提示敏感性和输出有效性在基于LLM的数据约束科学决策任务中分离，并显示高稳定性并不保证与地面真相的一致性或输出的有效性。

ABSTRACT

Large language models (LLMs) are increasingly used as decision-support tools in data-constrained scientific workflows, where correctness and validity are critical. However, evaluation practices often emphasize stability or reproducibility across repeated runs. While these properties are desirable, stability alone does not guar- antee agreement with statistical ground truth when such references are available. We introduce a controlled behavioral evaluation framework that explicitly sep- arates four dimensions of LLM decision-making: stability, correctness, prompt sensitivity, and output validity under fixed statistical inputs. We evaluate multi- ple LLMs using a statistical gene prioritization task derived from differential ex- pression analysis across prompt regimes involving strict and relaxed significance thresholds, borderline ranking scenarios, and minor wording variations. Our ex- periments show that LLMs can exhibit near-perfect run-to-run stability while sys- tematically diverging from statistical ground truth, over-selecting under relaxed thresholds, responding sharply to minor prompt wording changes, or producing syntactically plausible gene identifiers absent from the input table. Although sta- bility reflects robustness across repeated runs, it does not guarantee agreement with statistical ground truth in structured scientific decision tasks. These findings highlight the importance of explicit ground-truth validation and output validity checks when deploying LLMs in automated or semi-automated scientific work- flows.

研究动机与目标

Motivate the need for evaluating LLMs in data-constrained scientific workflows beyond stability.
Introduce a controlled behavioral framework separating four decision-making dimensions: stability, correctness, prompt sensitivity, and output validity.
Use a fixed differential expression (DE) table as a ground-truth reference to compare LLM outputs.
Characterize common failure modes in statistical gene prioritization under varied thresholding and prompt wording.

提出的方法

Provide a fixed DESeq2-derived differential expression table as input and query multiple LLMs (ChatGPT, Gemini, Claude) across regimes.
Vary thresholds (strict FDR ≤ 0.05, relaxed 0.05 < FDR ≤ 0.10), borderline ranking, and minor prompt wording changes (P7a vs P7b).
Assess outputs via four metrics: run-to-run stability (Jaccard), agreement with ground truth (Jaccard vs truth), prompt sensitivity (differences across prompts), and output validity (presence of invalid gene identifiers).
Use deterministic prompts and 10 repeated runs per configuration to isolate model behavior from data variability.
Provide code and results in a supplementary repository for reproducibility.

实验结果

研究问题

RQ1Does high run-to-run stability imply correctness with respect to a statistical ground truth?
RQ2How do minor prompt wording changes affect LLM decision outputs under fixed inputs?
RQ3What is the impact of relaxing statistical thresholds on LLM-based gene prioritization?
RQ4Do LLMs generate invalid or hallucinated gene identifiers even with stable outputs?

主要发现

LLMs can show near-perfect run-to-run stability while disagreeing with ground truth.
Small wording differences in prompts can markedly shift prioritization outcomes.
Relaxed statistical thresholds promote over-selection or collapse rather than reliable sensitivity improvements.
Models may produce syntactically plausible but invalid gene identifiers not present in the input, indicating output validity issues.
Stability reflects internal robustness but does not guarantee agreement with deterministic statistical references.
A four-dimensional evaluation framework is necessary to diagnose LLM behavior in data-constrained scientific workflows.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。