QUICK REVIEW

[论文解读] Towards a Science of AI Agent Reliability

Stephan Rabanser, Sayash Kapoor|arXiv (Cornell University)|Feb 18, 2026

Ethics and Social Impacts of AI被引用 2

一句话总结

本文提出一个对AI代理的安全关键、多维可靠性框架，将可靠性分解为一致性、鲁棒性、可预测性和安全性，并在两个基准上评估14个模型，以显示可靠性落后于能力提升。

ABSTRACT

AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 14 models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these persistent limitations, our metrics complement traditional evaluations while offering tools for reasoning about how agents perform, degrade, and fail.

研究动机与目标

通过在四个维度上借鉴安全关键工程原理来定义AI代理的可靠性：一致性、鲁棒性、可预测性和安全性。
提出一个12个指标的评估体系，在不依赖原始任务准确性的前提下衡量可靠性。
对当前AI代理进行基准测试和分析，绘制可靠性落后于能力提升的区域，并确定优先研究领域。
提供一个框架，帮助从业者在超越准确性的前提下推理代理的性能、退化和失效模式。

提出的方法

采用航空、核能、汽车以及过程控制中的可靠性概念，将可靠性分解为四个维度。
定义12个具体、维度特定的指标，与原始准确性无关（第3节）。
在维度内聚合指标并给出整体可靠性分数，提供透明的聚合选择（第3节中的方程与表）。
在两个基准（GAIA和τ-bench）上评估14个模型，进行多次运行、提示改写、容错注入、环境扰动、置信度估计和安全分析（第4节）。
通过归一化和基于比值的比较确保可靠性与能力解耦（第3.5.1节）。
提供详细的实验方案，包括多次评估（K=5）、改写提示、容错注入和安全分析（第4.1节）。

Figure 1 : Reliability gains lag behind capability progress. Overall reliability shows slow improvement over time. While accuracy rises steadily across both benchmarks (left), reliability trails behind (center), and the relationship between the two varies across benchmarks (right), indicating that a

实验结果

研究问题

RQ1如何在超越传统准确性指标的情况下定义和测量AI代理的可靠性？
RQ2当前AI代理在标准化基准上的经验性可靠性状况如何？
RQ3可靠性维度如何与模型能力和发布时间相关联？
RQ4哪些可靠性维度需要优先研究以实现可部署的AI代理？

主要发现

在模型版本和基准测试中，可靠性提升落后于能力进步。
结果一致性仍然较低；代理在重复运行中经常在解决任务时不具备一致性。
提示鲁棒性在不同模型间存在差异，前沿模型有适度改进但并非对改写具有普遍鲁棒性。
在新模型中标定（校准）有所提升，但在某些基准上区分能力可能下降，尤其是GAIA。
最近的前沿模型显示违规率较低，但发生违规时的伤害严重程度仍不可忽视。
一致性在较小模型中往往更高，表明较大模型存在更高的变异性；推理模型的可靠性有混合提升。

Figure 2 : Outcome consistency across models. Results show only modest consistency across the board; even current frontier models do not reliably improve across both benchmarks.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。