QUICK REVIEW

[论文解读] Real-World AI Evaluation: How FRAME Generates Systematic Evidence to Resolve the Decision-Maker's Dilemma

Reva Schwartz, Gabriella Waters|arXiv (Cornell University)|Feb 28, 2026

Ethics and Social Impacts of AI被引用 0

一句话总结

FRAME proposes a real-world AI evaluation framework that combines large-scale testing with contextual observation to convert user entropy into actionable, deployment-focused evidence for decision-makers.

ABSTRACT

The rapid expansion of AI deployments has put organizational leaders in a decision maker's dilemma: they must govern these technologies without systematic evidence of how systems behave in their own environments. Predominant evaluation methods generate scalable, abstract measures of model capabilities but smooth over the heterogeneity of real world use, while user focused testing reveals rich contextual detail yet remains small in scale and loosely coupled to the mechanisms that shape model behavior. The Forum for Real World AI Measurement and Evaluation (FRAME) addresses this gap by combining large scale trials of AI systems with structured observation of how they are used in context, the outcomes they generate, and how those outcomes arise. By tracing the path from an AI system's output through its practical use and downstream effects, FRAME turns the heterogeneity of AI in use into a measurable signal rather than a trade off for achieving scale. FRAME establishes two core assets to accomplish this: a Testing Sandbox that captures AI use under real workflows at scale and a Metrics Hub that translates those traces into actionable indicators.

研究动机与目标

通过捕捉真实使用场景中的AI，以填补以模型为中心的基准测试与部署需求之间的空白。
提供一个可扩展、标准化的基础设施，以观察AI的使用对象、使用者及其结果。
将观察结果转化为跨站点与跨行业可比、决策就绪的指标。
以部署为导向的证据补充现有基准测试，支持风险评估与价值实现。
通过结构化、情境化证据促进领导者的意义构建，而非抽象分数。

提出的方法

开发一个包含两条并行线的测试沙箱：远程参与者面板与脚本化聊天机器人运行。
使用常见场景、日志记录与情景特定评估标准来描述使用、非使用、依赖与放弃。
将LLM作为评审来使用相同的评估标准对面板参与者与脚本化输出进行标注，以获得并行的描述性编码。
建立一个指标枢纽，将沙箱结果转化为六大类指标及情境洞察。
在分布式FRAME联盟中的集中方法实验室运营，以确保速度、可重复性与跨站点可比性。）

Figure 2: The current evaluation ecosystem uses methods that mirror the traditional AI development lifecycle, often neglecting user entropy and suppressing the context needed to make sense of outcomes for decision-making. (Generative artificial intelligence was used to support the creation of this g

实验结果

研究问题

RQ1如何在规模上对实际使用进行建模，以揭示跨情境的用户熵和高阶影响？
RQ2哪些部署导向的指标最能描述现实世界中的AI效用、摩擦、风险与价值？
RQ3在沙箱评估中，自动化（LLM作为评审）与人类基础描述符如何对齐或发生偏离？
RQ4如何将沙箱试验的证据标准化，以支持跨行业的部署决策？

主要发现

FRAME结合面板参与者轨迹与脚本化聊天机器人运行，揭示现实世界使用与自动化评估之间的差距。
双流评分引擎输出描述性代码，用于系统行为的对比，以识别跨情境的摩擦与价值。
指标枢纽将输出分组为六大类指标，以实现跨行业与跨部署的比较。
真实世界评估提供了一个结构化的证据层，支持通过连接使用、结果及其下游影响来进行部署决策。
FRAME将用户熵定位为意义构建的核心信号，而非模型评估中的统计噪声。
沙箱基础设施支持政策测试与跨站点的系统比较，同时不暴露专有数据。）

Figure 3: An example of how three knowledge layers build up evidence across contexts to address the decision‑maker’s dilemma. (Generative artificial intelligence was used to support the creation of this graphic representing the authors’ own ideas, data, and words on this topic.)

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。