QUICK REVIEW

[논문 리뷰] Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A. Merrill, Alexander G. Shaw|arXiv (Cornell University)|2026. 01. 17.

Explainable Artificial Intelligence (XAI)인용 수 0

한 줄 요약

Terminal-Bench 2.0은 하드하고 실제 세계의 터미널 작업 데이터셋(89개 작업)과 재현 가능한 하네스를 도입합니다; 프런티어 모델의 평균 점수는 65% 미만이며, 오픈-웨이트 모델은 대략 36% 정도입니다.

ABSTRACT

AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 2.0: a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Each task features a unique environment, human-written solution, and comprehensive tests for verification. We show that frontier models and agents score less than 65\% on the benchmark and conduct an error analysis to identify areas for model and agent improvement. We publish the dataset and evaluation harness to assist developers and researchers in future work at https://www.tbench.ai/ .

연구 동기 및 목표

전문 IT 업무를 반영하는 터미널 기반의 장기 과제 벤치마크의 필요성을 동기화한다.
실행 가능한 검증을 포함한 다양하고 인간 검증된 어려운 터미널 작업 데이터셋을 만든다.
프런티어 모델과 에이전트를 벤치마킹하기 위한 재현 가능한 프레임워크와 평가 하네스를 제공한다.
향후 모델 및 에이전트 개선을 위한 실패 모드를 분석한다.
자동 터미널 작업의 비용, 효율성, 시간적 범위에 대한 통찰을 제공한다.

제안 방법

각 작업을 지시문, Docker 이미지, 테스트, 수작업으로 작성된 오라클 솔루션으로 정의하고 시간 제한 내로 구성한다.
난이도 및 품질 검토를 기반으로 229개 작업을 크라우드소싱하고 그중 89개를 Terminal-Bench 2.0에 선택한다.
특정성, 해결 가능성, 무결성을 보장하기 위한 엄격하고 다단계의 인간 심사 프로세스를 구현한다.
다양한 에이전트에 걸친 평가 표준화를 위해 Harbor와 중립적인 Terminus 2 발판(헤드리스 터미널, Bash 기반)을 사용한다.
16개의 프런티어 모델을 6명의 에이전트에 걸쳐 평가하고 모델/에이전트 쌍당 최소 다섯 번의 시험을 수행한다(총 32,155회 시험).
경험적 난이도와 상세한 오류 분류 체계를 제시하여 실패를 진단한다.

Figure 1: Task resolution rate per model on Terminal-Bench 2.0. The error bars correspond to a 95% confidence interval. The agent scaffold used to report each model was chosen to maximize performance. Results for all agents and models evaluated are in Appendix A .

실험 결과

연구 질문

RQ1프런티어 LLM과 에이전트가 장기적이고 실세계 터미널 작업을 해결하는 능력이 어느 정도인가?
RQ2모델 간 지배적인 실패 모드(실행, 일관성, 검증)는 무엇인가?
RQ3Terminal-Bench 2.0에서 성능에 영향을 미치는 요소로서 모델 선택과 에이전트 스캐폴딩의 비교는 어떠한가?
RQ4사람이 예측한 난이도 레이블이 경험적 모델 난이도와 어느 정도 일치하는가?
RQ5모델 간 Terminal-Bench 작업 해결에 따른 비용과 자원 관련 시사점은 무엇인가?

주요 결과

프런티어 모델과 에이전트는 Terminal-Bench 2.0에서 65% 미만의 작업을 해결하며, 더 작은 모델은 약 15%다.
GPT-5.2를 탑재한 Codex CLI가 평균 해답 비율 63%로 가장 높다.
Claude Opus 4.5를 탑재한 Terminus 2와 Gemini 3 Pro를 탑재한 Terminus 2가 각각 58%와 57%를 달성한다.
오픈-웨이트 모델인 Terminus 2와 Kimi K2 Thinking은 평균 약 36%에 도달한다.
작업 완료를 최적화할 때 모델 선택이 에이전트 스캐폴딩보다 성능에 더 큰 영향을 미치는 경우가 많다.
비용은 $1에서 $100 사이로, 대부분의 시도가 20분 이내이며, 일부 작업은 최대 두 시간 걸린다.

Figure 2: A Terminal-Bench task is composed of an instruction, a Dockerfile, a set of tests, and an oracle solution. Agents run inside a container into which the tests are copied and executed.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.