QUICK REVIEW

[论文解读] Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations

Evan Miller|arXiv (Cornell University)|Nov 1, 2024

Natural Language Processing Techniques被引用 6

一句话总结

tldr: 引入一个正式的统计框架用于语言模型评估，主张使用标准误、置信区间，以及配对/聚类分析来量化评估噪声并比较模型。提供关于方差降低和评估设计的效力分析的实用指南。

ABSTRACT

Evaluations are critical for understanding the capabilities of large language models (LLMs). Fundamentally, evaluations are experiments; but the literature on evaluations has largely ignored the literature from other sciences on experiment analysis and planning. This article shows researchers with some training in statistics how to think about and analyze data from language model evaluations. Conceptualizing evaluation questions as having been drawn from an unseen super-population, we present formulas for analyzing evaluation data, measuring differences between two models, and planning an evaluation experiment. We make a number of specific recommendations for running language model evaluations and reporting experiment results in a way that minimizes statistical noise and maximizes informativeness.

研究动机与目标

Frame eval questions as draws from an unseen super-population to study the underlying skill an eval measures.
Provide formulas and practical recommendations for computing standard errors and confidence intervals in evals.
Develop methods for comparing two models using unpaired and paired analyses, including clustered standard errors.
Offer variance-reduction strategies and a power analysis framework to guide experiment design and reporting.

提出的方法

Model evaluation scores are decomposed into a conditional mean and a zero-mean random component.
Use Central Limit Theorem to estimate standard errors of the mean and report SE alongside means.
Introduce clustered standard errors to handle non-independent questions within clusters.
Propose next-token probability analysis as a variance-reduction technique when available.
Derive paired-differences standard errors to exploit correlation when comparing two models on the same questions.
Provide a power analysis formula for required sample size given desired detectability.

实验结果

研究问题

RQ1How should eval results be analyzed to reflect uncertainty about the true super-population mean?
RQ2How can standard errors and confidence intervals be correctly computed for eval scores under independent and clustered question sampling?
RQ3How should model comparisons be performed to maximize statistical power (unpaired vs. paired, clustered)?
RQ4What strategies minimize eval variance (resampling, next-token probabilities) without biasing results?
RQ5What sample size and minimum detectable effect (MDE) are needed to reliably detect model differences?

主要发现

Standard errors of the mean should be reported for eval scores, calculated via the Central Limit Theorem (SE = sqrt(Var(s)/n)).
Clustered standard errors are necessary when questions are drawn in related groups, and can be substantially larger (e.g., up to 3x) than naive SEs.
Pairwise (paired) analysis reduces variance when comparing two models on the same questions, leveraging correlation between models on each question.
Next-token probabilities can further reduce conditional variance by replacing generated answers with probabilities, when available.
A power analysis and a sample-size formula are provided to determine the number of questions needed to detect a given effect size with specified alpha and beta (n = (z_alpha/2 + z_beta)^2 (omega^2 + sigma_A^2/K_A + sigma_B^2/K_B) / delta^2).
The paper argues that reported confidence intervals in some real evals may be anti-conservative (too narrow) due to ignoring clustering and variance structure.]

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。