[论文解读] Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Terminal-Bench 2.0 引入了一个硬核、真实世界的终端任务数据集(89 个任务)及一个可复现的评估框架;前沿模型的平均分低于 65%,开源权重模型约为 36%。
AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 2.0: a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Each task features a unique environment, human-written solution, and comprehensive tests for verification. We show that frontier models and agents score less than 65\% on the benchmark and conduct an error analysis to identify areas for model and agent improvement. We publish the dataset and evaluation harness to assist developers and researchers in future work at https://www.tbench.ai/ .
研究动机与目标
- Need to translate: Motivate the need for terminal-based, long-horizon benchmarks that reflect professional IT work.
- Create a diverse, human-verified dataset of hard terminal tasks with executable verification.
- Provide a reproducible framework and evaluation harness to benchmark frontier models and agents.
- Analyze failure modes to guide future model and agent improvements.
- Offer insights into cost, efficiency, and time horizons of automated terminal work.
提出的方法
- Define each task as an instruction, a Docker image, tests, and a hand-written oracle solution within a time limit.
- Crowd-source 229 tasks and select 89 for Terminal-Bench 2.0 based on difficulty and quality reviews.
- Implement a rigorous, multi-round human auditing process to ensure specificity, solvability, and integrity.
- Use Harbor and a neutral Terminus 2 scaffold (headless terminal, Bash-based) to standardize evaluations across agents.
- Evaluate 16 frontier models across 6 agents and run at least five trials per model/agent pair (32,155 trials total).
- Report results with empirical difficulty and a detailed error taxonomy to diagnose failures.

实验结果
研究问题
- RQ1How capable are frontier LLMs and agents at solving long-horizon, real-world terminal tasks?
- RQ2What are the dominant failure modes (execution, coherence, verification) across models?
- RQ3How does model choice compare to agent scaffolding in affecting performance on Terminal-Bench 2.0?
- RQ4To what extent do human-predicted difficulty labels align with empirical model difficulty?
- RQ5What are the cost and resource implications of solving Terminal-Bench tasks across models?
主要发现
- Frontier models and agents resolve less than 65% of tasks on Terminal-Bench 2.0, with smaller models around 15%.
- Codex CLI with GPT-5.2 achieves the highest average resolution rate of 63%.
- Terminus 2 with Claude Opus 4.5 and with Gemini 3 Pro achieve 58% and 57%, respectively.
- Open-weight models like Terminus 2 and Kimi K2 Thinking reach about 36% on average.
- Model choice often dominates performance over agent scaffold when optimizing for task completion.
- Cost ranges from $1 to $100, with most attempts under 20 minutes, and some tasks taking up to two hours

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。