QUICK REVIEW

[논문 리뷰] Brittlebench: Quantifying LLM robustness via prompt sensitivity

Angelika Romanou, Mark Ibrahim|arXiv (Cornell University)|2026. 02. 27.

Natural Language Processing Techniques인용 수 0

한 줄 요약

Brittlebench는 프롟그를 통해 모델의 취약성을 구성하는 프롬프트 유발 가변성과 작업 난이도를 분리하여 모델 취약성을 정량화하고, 의미 보존 프롬프트 변변화를 적용해 frontier 및 commercial LLMs를 평가합니다.

ABSTRACT

Existing evaluation methods largely rely on clean, static benchmarks, which can overestimate true model performance by failing to capture the noise and variability inherent in real-world user inputs. This is especially true for language models, which can face human-generated text queries containing mistakes, typos, or alternative ways of phrasing the same question. In this work, we introduce a theoretical framework for quantifying model sensitivity to prompt variants, or brittleness, that can enable us to disentangle data-induced difficulty from prompt-related variability. Using this framework, we design a novel evaluation pipeline, Brittlebench, to holistically evaluate the sensitivity of frontier models. We apply semantics-preserving perturbations to a suite of popular benchmarks, and observe model performance to degrade as much as 12%. However, these perturbations do not affect all models equally: even a single perturbation alters the relative ranking of models in 63% of cases, impacting conclusions about comparative model performance. Decomposing the total variance of both state-of-the-art open-weight and commercial models, we find that semantics-preserving input perturbations can account for up to half of the performance variance for a given model. Brittlebench highlights the need for more robust evaluations and models, and allows us to systematically understand model brittleness.

연구 동기 및 목표

동기: 정적 벤치마크가 실제 노이즈가 있거나 varied prompts에 대한 강건성을 잘 대변하지 못할 수 있습니다.
목표: 프롬프트 형식이 성능 변동성(취약성)에 얼마나 기여하는지, 고유 작업 난이도와 구분하여 정량화합니다.
목표: 의미 보존 변동의 통합 분류체계와 벤치마크 및 모델 계열 전반의 모델 강건성을 측정하는 메타 평가 파이프라인을 개발합니다.

제안 방법

관측된 정확도 분산을 데이터 난이도(V_data)와 변형 민감도(V_brittleness)로 분할하는 분산 분해 프레임워크를 제안합니다.
모델 수준 및 벤치마크 수준의 취약도 점수(Pi_m, Pi_b)를 총 분산 중 perturbations에 기인하는 비율로 정의합니다.
단어 조작, 맥락 보강, 프롬프트 패딩, 의역, 수학/코드 변형 등 변동의 분류를 만듭니다.
기존 벤치마크(MMLU, TruthfulQA, ARC, MathQA, GPQA, LogiQA)에 의미 보존 변동을 적용하고 frontier 및 open-weight 모델과 상용 모델(GPT-5, Claude 4.5 Opus, Llama3, Qwen3)을 평가합니다.
cosine 유사성 검사로 의미를 제어하고 open-weight 모델에는 로그 확률 기반 점수, 상용 모델에는 API 프롬프트를 사용하는 평가 하네스를 사용합니다.

실험 결과

연구 질문

RQ1표준 벤치마크에서 관측된 모델 성능 변동성 중 프롬프트 변형이 intrinsic한 작업 난이도보다 어느 정도 기여하는가?

주요 결과

의미 보존 변동은 여러 모델과 벤치마크에서 성능을 악화시키며, 표면 형태의 변화가 가장 큰 하락을 유발하는 경우가 많습니다(일부 설정에서 최대 약 12%).
프롬프트 변형은 오픈-weight 모델의 경우 63%의 경우에서 모델 순위를 바꿀 수 있으며, 순위 변화는 변형 유형에 따라 다릅니다.
Perturbation에 의한 분산은 많은 오픈-weight 모델에서 전체 분산의 약 절반까지 차지하여 입력 변동에 대한 견고함이 모델 행동의 독립 축임을 시사합니다.
프롬프트 패딩 및 단어 수준 변형은 약식 설정에서 취약성을 증폭시키며, LLM이 생성한 의역은 상대적으로 덜 해롭습니다.
합성 변형(여러 변형의 결합)은 종종 더 큰 degrade를 낳아 단일 평가 케이스에서 약 45%까지 하락하는 비가산 효과를 보였습니다.
사고의 연쇄(Chain-of-thought)는 정확도를 높이지만 변형 하에서의 취약성 완화는 제한적입니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.