QUICK REVIEW

[论文解读] BRIDGE: Predicting Human Task Completion Time From Model Performance

Fengyuan Liu, Jay Gala|arXiv (Cornell University)|Feb 6, 2026

Ethics and Social Impacts of AI被引用 0

一句话总结

BRIDGE 使用 2PL IRT 模型将模型性能与人类任务完成时间对齐，使其能够预测新基准的人类任务时长，并在无需新的人工注释的情况下预测前沿模型能力。

ABSTRACT

Evaluating the real-world capabilities of AI systems requires grounding benchmark performance in human-interpretable measures of task difficulty. Existing approaches that rely on direct human task completion time annotations are costly, noisy, and difficult to scale across benchmarks. In this work, we propose BRIDGE, a unified psychometric framework that learns the latent difficulty scale from model responses and anchors it to human task completion time. Using a two-parameter logistic Item Response Theory model, we jointly estimate latent task difficulty and model capability from model performance data across multiple benchmarks. We demonstrate that latent task difficulty varies linearly with the logarithm of human completion time, allowing human task completion time to be inferred for new benchmarks from model performance alone. Leveraging this alignment, we forecast frontier model capabilities in terms of human task length and independently reproduce METR's exponential scaling results, with the 50% solvable task horizon doubling approximately every 6 months.

研究动机与目标

通过以人类完成时间为锚点，将潜在的模型难度与人类可解释的任务难度联系起来，弥合基准分数与人类可解释的任务难度之间的差距。
在多个基准上联合估计任务难度和模型能力，采用两参数逻辑 IRT 模型。
仅使用模型性能数据即可预测新基准的人工完成时间。
在不进行新的人类研究的情况下，预测前沿模型能力的时间长度。

提出的方法

将 2PL IRT 模型拟合到二元的模型–任务结果，以估计各基准的任务区分度 a_i、任务难度 b_i，以及模型能力 θ_j。
通过对具有人类注释的任务对潜在难度尺度进行回归 log(h_k) 对 b_k 的回归，将潜在难度尺度锚定到人类时间，建立对数线性映射。
使用校准后的映射来预测缺乏注释任务的人类完成时间。
通过将每个版本窗口的最佳模型能力映射到通过对数线性映射预测的人类任务时长，来预测能力前沿的时间点。
评估与人类时间注释的一致性，并将 BRIDGE 与基线（对数几率成功率、LLM 预测）进行比较。

Figure 1 : Overview of BRIDGE. Model responses across different benchmarks (clustered by colors) are used to fit a two-parameter logistic Item Response Theory (2PL IRT) model, estimating latent task difficulty and model capability. Calibrating latent difficulty against tasks with known human task co

实验结果

研究问题

RQ1IRT 推断的潜在任务难度是否能与跨基准的人类任务完成时间对齐？
RQ2是否可以仅通过模型性能在不进行新的人工研究的情况下预测新基准的人类任务时长？
RQ3BRIDGE 预测的前沿任务时长在模型发布日期上如何演变？
RQ4BRIDGE 的预测是否与真实的人类注释及在多样化基准上的定性预期相一致？

主要发现

潜在任务难度 b_i 与 log(人类时间) 的相关性达到 R^2 = 0.81，能够通过 IRT 的难度估计得到时间估算。
预测显示前沿模型在 50% 成功率时可解决任务大约 1.4–2.5 小时，且每约 6 个月翻倍。
BRIDGE 的预测在 SWE-bench Verified 与 Cybench 等基准上与人类时间高度一致，优于基于逻辑回归和基于 LLM 的基线。
在不增加额外注释的情况下，预测的任务时间尺度能推广到 SWE-bench Verified、MLE-bench、GDPval、Cybench 等分布外基准。
仅用模型性能数据就能再现模型版本更新所导致的可解决任务的指数级增长，支持 METR 趋势。

Figure 2 : Task length (human completion time) vs. latent task difficulty ( $b$ ) estimated via 2PL IRT across METR task suites (SWAA, HCAST, RE-bench), based on Equation ˜ 3 . The log-linear fit ( $R^{2}=0.81$ ) shows that each unit increase in $b$ corresponds to $\sim 2.26\times$ longer human comp

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。