QUICK REVIEW

[論文レビュー] BRIDGE: Predicting Human Task Completion Time From Model Performance

Fengyuan Liu, Jay Gala|arXiv (Cornell University)|Feb 6, 2026

Ethics and Social Impacts of AI被引用数 0

ひとこと要約

BRIDGE は 2PL IRT モデルを用いてモデル性能を人間のタスク完了時間と整合させ、新しいベンチマークの人間のタスク継続時間を予測し、追加の人間注釈なしでフロンティアモデルの能力を予測できる。

ABSTRACT

Evaluating the real-world capabilities of AI systems requires grounding benchmark performance in human-interpretable measures of task difficulty. Existing approaches that rely on direct human task completion time annotations are costly, noisy, and difficult to scale across benchmarks. In this work, we propose BRIDGE, a unified psychometric framework that learns the latent difficulty scale from model responses and anchors it to human task completion time. Using a two-parameter logistic Item Response Theory model, we jointly estimate latent task difficulty and model capability from model performance data across multiple benchmarks. We demonstrate that latent task difficulty varies linearly with the logarithm of human completion time, allowing human task completion time to be inferred for new benchmarks from model performance alone. Leveraging this alignment, we forecast frontier model capabilities in terms of human task length and independently reproduce METR's exponential scaling results, with the 50% solvable task horizon doubling approximately every 6 months.

研究の動機と目的

ベンチマークスコアと人間に解釈可能なタスク難易度のギャップを、潜在的モデル難易度を人間の完了時間にアンカー付けして埋める。
複数のベンチマークを跨いで、2パラメータロジスティックIRTモデルを用いてタスク難易度とモデル能力を同時推定する。
新しいベンチマークに対して、モデルの性能データのみを用いて人間のタスク完了時間を予測できるようにする。
新しい人間研究を行うことなく、フロンティアモデルの能力を人間のタスク長さの観点で予測する。

提案手法

バイナリのモデル–タスク結果に対して2PL IRTモデルを適合させ、ベンチマーク全体でタスク識別力 a_i、タスク難易度 b_i、モデル能力 θ_j を推定する。
人間の注釈があるタスクについて、b_k で log(h_k) を回帰させ、潜在的難易度スケールを人間時間にアンカー付けして、ログ線形の写像を確立する。
注釈の欠如したタスクに対して、キャリブレーションされた写像を用いて人間の完了時間を予測する。
リリースウィンドウごとの最良モデル能力をログ線形写像を介して予測された人間のタスク長にマッピングすることで、モデル能力のホライズンを予測する。
人間の時間注釈との整合性を評価し、BRIDGE をベースライン（対数ロジット成功率、LLM予測）と比較する。

Figure 1 : Overview of BRIDGE. Model responses across different benchmarks (clustered by colors) are used to fit a two-parameter logistic Item Response Theory (2PL IRT) model, estimating latent task difficulty and model capability. Calibrating latent difficulty against tasks with known human task co

実験結果

リサーチクエスチョン

RQ1IRT推定の潜在的タスク難易度がベンチマークを跨いで人間のタスク完了時間と整合するか。
RQ2新しいベンチマークの人間タスク時間を新たな人間研究なしにモデル性能だけから予測できるか。
RQ3BRIDGE の予測するフロンティアタスク長はモデルリリース日でどう変化するか。
RQ4BRIDGE の予測は多様なベンチマークで地道な注釈と定性的な期待と一致するか。

主な発見

潜在的タスク難易度 b_i は log(人間時間) と相関し R^2 = 0.81、IRT難易度から時間を推定可能。
予測上のフロンティアモデルは、50% 成功で約1.4–2.5時間の解決可能タスクに到達し、約6ヶ月ごとに倍増。
BRIDGE の予測は SWE-bench Verified および Cybench で人間時間と密接に一致し、ロジットベースおよびLLMベースのベースラインより優れている。
注釈を追加せずに SWE-bench Verified、MLE-bench、GDPval、Cybench のような分布外ベンチマークにも適用可能なタスク時間の将来予測域を一般化。
モデルリリースごとの解決可能タスクの指数的成長は、モデル性能データのみを用いて再現され、METRトレンドを裏付ける。

Figure 2 : Task length (human completion time) vs. latent task difficulty ( $b$ ) estimated via 2PL IRT across METR task suites (SWAA, HCAST, RE-bench), based on Equation ˜ 3 . The log-linear fit ( $R^{2}=0.81$ ) shows that each unit increase in $b$ corresponds to $\sim 2.26\times$ longer human comp

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。