QUICK REVIEW

[論文レビュー] Fault-Tolerant Evaluation for Sample-Efficient Model Performance Estimators

Zihan Zhu, Yanqiu Wu|arXiv (Cornell University)|Feb 6, 2026

Software System Performance and Reliability被引用数 0

ひとこと要約

要約: 本研究は FT-Eval を提案する。これは、バイアスと分散を同時に考慮する可調整許容誤差を組み込み、許容誤差を自動的に選択する方法でサンプル効率モデル性能推定量を評価する fault-tolerant フレームワークである。伝統的な指標が対立することを示し、FT-Eval が一貫した実用的な洞察を提供することを示す。

ABSTRACT

In the era of Model-as-a-Service, organizations increasingly rely on third-party AI models for rapid deployment. However, the dynamic nature of emerging AI applications, the continual introduction of new datasets, and the growing number of models claiming superior performance make efficient and reliable validation of model services increasingly challenging. This motivates the development of sample-efficient performance estimators, which aim to estimate model performance by strategically selecting instances for labeling, thereby reducing annotation cost. Yet existing evaluation approaches often fail in low-variance settings: RMSE conflates bias and variance, masking persistent bias when variance is small, while p-value based tests become hypersensitive, rejecting adequate estimators for negligible deviations. To address this, we propose a fault-tolerant evaluation framework that integrates bias and variance considerations within an adjustable tolerance level ${\varepsilon}$, enabling the evaluation of performance estimators within practically acceptable error margins. We theoretically show that proper calibration of ${\varepsilon}$ ensures reliable evaluation across different variance regimes, and we further propose an algorithm that automatically optimizes and selects ${\varepsilon}$. Experiments on real-world datasets demonstrate that our framework provides comprehensive and actionable insights into estimator behavior.

研究の動機と目的

変動する分散の下で、RMSE と p値がサンプル効率の性能推定量を評価する際の限界を特定する。
ユーザー定義の許容誤差を用いて推定量の品質を評価するフォールトトレラント評価フレームワークを提案する。
予算を跨いで許容誤差 (epsilon) を自動的に選択・最適化するアルゴリズムを開発する。
多様なデータセットとモデルでフレームワークを実証し、従来の指標より解釈性と信頼性が向上することを示す。

提案手法

FT-Eval を導入し、推定量の出力が [theta* - epsilon, theta* + epsilon] に収まるかを確認する二つの有界片側 t 検定 (TOST) を使用する。
epsilon を用いた下限と上限に基づくフォールトトレラント帰無仮説 H0^(L) と H0^(U) を定義する。
t 分布からの p 値 p^(L) および p^(U) を計算し、両方の境界が棄却されるかどうかで FT-Eval の成功を決定する。
|Bias| + t_alpha,N-1 * sqrt(Var/(N-1)) < epsilon をコアの許容調整条件として導出する。
アルゴリズム 1 を提案し、識別マージン delta* を自動選択し、予算ごとに epsilon を動的に調整するために二分探索と Run(E,N,k) 評価を用いる。
AT および RS 推定量を、適切な prior と分割戦略とともに、分類器設定と LLM 設定の両方へ拡張する。

Figure 1 : An overview of the evaluation challenge for sample-efficient model performance estimators. (a) AI models accessed via web APIs support various applications and users. (b) Performance estimators ( e.g., Active Testing or Random Sampling) query and label task samples within a labeling budge

実験結果

リサーチクエスチョン

RQ1低分散条件で RMSE と p値が推定量の品質をどう誤表現し得るか？
RQ2許容誤差ベースのフレームワークはバイアスと分散を共同で考慮し、信頼できる推定量評価を生むことができるか？
RQ3自動的な手順は labeling budget across を区別する epsilon を効果的に選択できるか？
RQ4FT-Eval の評価は、伝統的な指標と比較して、ビジョン、NLP、LLMs を含む多様なデータセット・モデルで一貫性があり、実用的な洞察を提供するか？

主な発見

従来の指標（RMSE と p値）は、テスト設定の 73.3% でしばしば矛盾し、信頼性を疑問視させる。
調整可能な許容誤差を持つ FT-Eval は、予算と分散レジーム全体で一貫した解釈可能な評価を提供する。
動的で予算を考慮した epsilon は、推定量間の実用的に意味のある差を識別するのに役立つ（例: AT 対 RS）。
複数のデータセットとモデルを通じた実験は、従来指標を超える実用的洞察を FT-Eval が生み出すことを示す。

Figure 2 : A comparison on two estimators, active testing (AT) and random sampling (RS), on 20 Newsgroup: (a) estimated performance ( i.e., accuracy) with their mean and standard deviation across multiple runs, against the ground truth performance $\theta^{*}=0.695$ (the red dashed line); (b) RMSE,

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。