QUICK REVIEW

[論文レビュー] Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A. Merrill, Alexander G. Shaw|arXiv (Cornell University)|Jan 17, 2026

Explainable Artificial Intelligence (XAI)被引用数 0

ひとこと要約

Terminal-Bench 2.0 は現実世界の厳しい端末タスクを評価するためのデータセットとハーネスを提供。フロンティアモデルは平均で65%未満、オープンウェイトモデルは約36%程度。

ABSTRACT

AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 2.0: a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Each task features a unique environment, human-written solution, and comprehensive tests for verification. We show that frontier models and agents score less than 65\% on the benchmark and conduct an error analysis to identify areas for model and agent improvement. We publish the dataset and evaluation harness to assist developers and researchers in future work at https://www.tbench.ai/ .

研究の動機と目的

専門IT作業を反映した長期的な端末ベースのベンチマークの必要性を動機づける。
実行可能な検証を備えた難易度の高い端末タスクの多様で人間検証済みデータセットを作成する。
フロンティアモデルとエージェントをベンチマークする再現性のあるフレームワークと評価ハーネスを提供する。
失敗モードを分析し、将来のモデルとエージェントの改善を導く。
自動化端末作業のコスト、効率、時間軸に関する洞察を提供する。

提案手法

各タスクを指示、Docker イメージ、テスト、手書きオラクル解答を時間制限内で定義する。
Crowd-source 229 タスクを収集し、難易度と品質評価に基づいて Terminal-Bench 2.0 用の 89 タスクを選定する。
特異性、解法性、完全性を確保するための厳密で複数ラウンドの人間監査プロセスを実施する。
Harbor と中立的な Terminus 2 スキャフォールド（ヘッドレス端末、Bash ベース）を使用して評価を標準化する。
16 のフロンティアモデルを 6 エージェントで評価し、モデル/エージェントペアごとに少なくとも5回の試行を実施（総計 32,155 試行）。
経験的難度とエラー分類の詳細な報告を行い、失敗を診断する。

Figure 1: Task resolution rate per model on Terminal-Bench 2.0. The error bars correspond to a 95% confidence interval. The agent scaffold used to report each model was chosen to maximize performance. Results for all agents and models evaluated are in Appendix A .

実験結果

リサーチクエスチョン

RQ1フロンティア LLM とエージェントは長期の現実世界端末タスクを解決する能力がどれほどか？
RQ2モデル間での支配的な失敗モード（実行、首尾一貫性、検証）は何か？
RQ3Task 完遂の観点で、モデルの選択とエージェントのスキャフォールディングはどう影響するか？
RQ4人間が予測した難易度ラベルと実証的なモデル難易度はどの程度一致するか？
RQ5モデル間の Terminal-Bench タスク解決にかかるコストとリソースはどの程度か？

主な発見

フロンティアモデルとエージェントは Terminal-Bench 2.0 のタスクのうち 65% 未満、規模の小さいモデルで約 15% を解決する。
Codex CLI と GPT-5.2 は平均解決率で最高の 63% を達成。
Terminus 2 は Claude Opus 4.5、Gemini 3 Pro でそれぞれ 58%、57% を達成。
オープンウェイトモデル（Terminus 2、Kimi K2 Thinking）は平均で約 36% に到達。
モデルの選択は、タスク完遂を最適化する際にエージェントスキャフォールドよりも性能に影響することが多い。
コストは 1 ドルから 100 ドルの範囲で、試行の大半は20分以下、まれに2時間かかるタスクもある。

Figure 2: A Terminal-Bench task is composed of an instruction, a Dockerfile, a set of tests, and an oracle solution. Agents run inside a container into which the tests are copied and executed.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。