QUICK REVIEW

[论文解读] Brittlebench: Quantifying LLM robustness via prompt sensitivity

Angelika Romanou, Mark Ibrahim|arXiv (Cornell University)|Feb 27, 2026

Natural Language Processing Techniques被引用 0

一句话总结

简要：Brittlebench 引入一个方差分解框架，通过将任务难度与提示引发的变异分离，量化模型脆弱性，并对语义保持的提示扰动在前沿与商业大语言模型上进行评估。

ABSTRACT

Existing evaluation methods largely rely on clean, static benchmarks, which can overestimate true model performance by failing to capture the noise and variability inherent in real-world user inputs. This is especially true for language models, which can face human-generated text queries containing mistakes, typos, or alternative ways of phrasing the same question. In this work, we introduce a theoretical framework for quantifying model sensitivity to prompt variants, or brittleness, that can enable us to disentangle data-induced difficulty from prompt-related variability. Using this framework, we design a novel evaluation pipeline, Brittlebench, to holistically evaluate the sensitivity of frontier models. We apply semantics-preserving perturbations to a suite of popular benchmarks, and observe model performance to degrade as much as 12%. However, these perturbations do not affect all models equally: even a single perturbation alters the relative ranking of models in 63% of cases, impacting conclusions about comparative model performance. Decomposing the total variance of both state-of-the-art open-weight and commercial models, we find that semantics-preserving input perturbations can account for up to half of the performance variance for a given model. Brittlebench highlights the need for more robust evaluations and models, and allows us to systematically understand model brittleness.

研究动机与目标

动机：静态基准可能错误地呈现对嘈杂或多样化提示的真实鲁棒性。
目标：量化提示表述对性能变异性的贡献（脆弱性），与固有任务难度分离。
目的：开发一个语义保持扰动的统一分类法，以及一个元评估流程，以在基准与模型家族之间测量模型鲁棒性。

提出的方法

提出一个方差分解框架，将观测到的准确率方差分解为数据难度（V_data）和扰动敏感性（V_brittleness）。
定义模型级与基准级的脆弱性分数（Pi_m，Pi_b），作为扰动导致的总方差的分数。
创建扰动分类法（词语操作、上下文增强、提示填充、改写、数学/代码扰动）。
对现有基准进行语义保持扰动（MMLU、TruthfulQA、ARC、MathQA、GPQA、LogiQA），并评估前沿与开放权重模型以及商业模型（GPT-5、Claude 4.5 Opus、Llama3、Qwen3）。
通过余弦相似性检查控制语义，并使用一个评估工具箱（lm-evaluation-harness）对开放权重模型采用基于对数概率的评分，对商业模型使用 API 提示进行评估。

实验结果

研究问题

RQ1在标准基准上观察到的模型性能变异性有多少来自提示扰动，而非内在任务难度？
RQ2语义保持扰动是否系统性地降低性能，且降幅是否随模型规模、任务或提示策略（零-shot 与少量-shot）而异？
RQ3扰动类型及强度如何影响前沿与开放权重模型的排名和鲁棒性？
RQ4测试时策略是否能缓解脆弱性，连锁思维（chain-of-thought）在扰动下的鲁棒性如何交互？

主要发现

语义保持扰动在不同模型与基准上均会降低性能，表面形式的变化在某些设置下造成最大下降（约在 12% 左右）。
提示扰动在开放权重模型中有63%的情形会改变模型排名，且排名变动取决于扰动类型。
扰动引起的方差在许多开放权重模型中占总方差的近一半，表明对输入变异性的鲁棒性是模型行为的一个独立维度。
提示填充和逐词扰动在少样本设置下放大脆弱性，而由大语言模型生成的改写相对危害较小。
组合扰动（将扰动组合）通常会带来更大下降，在单次评估中有时高达约45%，显示非相加效应。
链式推理（思路叠加）提升了准确性，但在扰动下对脆弱性的缓解作用仅为适度。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。