QUICK REVIEW

[논문 리뷰] BRIDGE: Predicting Human Task Completion Time From Model Performance

Fengyuan Liu, Jay Gala|arXiv (Cornell University)|2026. 02. 06.

Ethics and Social Impacts of AI인용 수 0

한 줄 요약

BRIDGE는 모델 성능을 인간 작업 완료 시간과 일치시키기 위해 2PL IRT 모델을 사용하며, 새로운 벤치마크의 인간 작업 소요 시간 예측과 새로운 인간 주석 없이 프런티어 모델의 역량 예측을 가능하게 한다.

ABSTRACT

Evaluating the real-world capabilities of AI systems requires grounding benchmark performance in human-interpretable measures of task difficulty. Existing approaches that rely on direct human task completion time annotations are costly, noisy, and difficult to scale across benchmarks. In this work, we propose BRIDGE, a unified psychometric framework that learns the latent difficulty scale from model responses and anchors it to human task completion time. Using a two-parameter logistic Item Response Theory model, we jointly estimate latent task difficulty and model capability from model performance data across multiple benchmarks. We demonstrate that latent task difficulty varies linearly with the logarithm of human completion time, allowing human task completion time to be inferred for new benchmarks from model performance alone. Leveraging this alignment, we forecast frontier model capabilities in terms of human task length and independently reproduce METR's exponential scaling results, with the 50% solvable task horizon doubling approximately every 6 months.

연구 동기 및 목표

벤치마크 점수와 인간이 해석할 수 있는 작업 난이도 사이의 간극을 좁히기 위해 잠재 모델 난이도를 인간 완료 시간에 고정한다.
여러 벤치마크에 걸쳐 2-파라미터 로지스틱 IRT 모델을 사용하여 작업 난이도와 모델 능력을 함께 추정한다.
모델 성능 데이터만으로 새로운 벤치마크에 대한 인간 작업 완료 시간을 예측할 수 있게 한다.
새로운 인간 연구를 수행하지 않고 인간 작업 시간 측면에서 프런티어 모델의 역량을 예측한다.

제안 방법

벤치마크 전반에 걸쳐 이진 모델–작업 결과에 대해 2PL IRT 모델을 맞추어 작업 구분도 a_i, 작업 난이도 b_i, 그리고 모델 능력 θ_j를 추정한다.
인간 주석이 있는 작업에 대해 log(h_k)를 b_k에 회귀시켜 잠재 난이도 축을 인간 시간에 고정하고 로그-선형 매핑을 확립한다.
보정된 매핑을 사용해 주석이 없는 작업의 인간 완료 시간을 예측한다.
릴리스 창별 최상위 모델의 능력을 로그-선형 매핑을 통해 예측된 인간 작업 길이에 매핑하여 모델 역량의 시야를 예측한다.
인간 시간 주석과의 정합성을 평가하고 BRIDGE를 기준선(로짓 성공률, LLM 예측)과 비교한다.

Figure 1 : Overview of BRIDGE. Model responses across different benchmarks (clustered by colors) are used to fit a two-parameter logistic Item Response Theory (2PL IRT) model, estimating latent task difficulty and model capability. Calibrating latent difficulty against tasks with known human task co

실험 결과

연구 질문

RQ1IRT로 추정된 잠재 작업 난이도가 벤치마크 전반에 걸쳐 인간 작업 완료 시간과 일치하는가?
RQ2새로운 벤치마크에 대한 인간 작업 지속 시간을 모델 성능만으로 새로운 인간 연구 없이 예측할 수 있는가?
RQ3BRIDGE의 예측 프런티어 작업 길이가 모델 릴리스 날짜에 따라 어떻게 변하는가?
RQ4BRIDGE 예측이 다양한 벤치마크에서 실제 인간 주석과 질적 기대에 얼마나 일치하는가?

주요 결과

잠재 작업 난이도 b_i는 log(인간 시간)과 상관관계가 있으며 R^2 = 0.81로, IRT 난이도에서 시간 추정이 가능하다.
예측에 따르면 프런티어 모델은 50% 성공에서 약 1.4–2.5시간의 해결 가능한 작업에 도달하며, 대략 6개월마다 두 배로 증가한다.
BRIDGE 예측은 SWE-bench Verified와 Cybench에서 인간 시간과 밀접하게 일치하며 로짓 기반 및 LLM 기반 기준선보다 우수하다.
예상된 작업 시간 시야는 SWE-bench Verified, MLE-bench, GDPval, Cybench와 같은 분포 밖 벤치마크에 추가 주석 없이 일반화된다.
모델 릴리스에 따른 해결 가능한 작업의 지수적 증가가 모델 성능 데이터만으로 재현되어 METR 추세를 뒷받침한다.

Figure 2 : Task length (human completion time) vs. latent task difficulty ( $b$ ) estimated via 2PL IRT across METR task suites (SWAA, HCAST, RE-bench), based on Equation ˜ 3 . The log-linear fit ( $R^{2}=0.81$ ) shows that each unit increase in $b$ corresponds to $\sim 2.26\times$ longer human comp

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.