QUICK REVIEW

[论文解读] General Agent Evaluation

Elron Bandel, Asaf Yehudai|arXiv (Cornell University)|Feb 26, 2026

Multi-Agent Systems and Negotiation被引用 2

一句话总结

本文介绍 Exgentic 与统一协议，用于在多样化基准上评估通用代理，并展示 Open General Agent Leaderboard，显示模型质量在跨任务中的表现占主导。

ABSTRACT

The promise of general-purpose agents - systems that perform tasks in unfamiliar environments without domain-specific engineering - remains largely unrealized. Existing agents are predominantly specialized, and while emerging implementations like OpenAI SDK Agent and Claude Code hint at broader capabilities, no systematic evaluation of their general performance has been pursued. Current agentic benchmarks assume domain-specific integration, encoding task information in ways that preclude fair evaluation of general agents. This paper frames general-agent evaluation as a first-class research objective. We propose conceptual principles for such evaluation, a Unified Protocol enabling agent-benchmark integration, and Exgentic - a practical framework for general agent evaluation. We benchmark five prominent agent implementations across six environments as the first Open General Agent Leaderboard. Our experiments show that general agents generalize across diverse environments, achieving performance comparable to domain-specific agents without any environment-specific tuning. We release our evaluation protocol, framework, and leaderboard to establish a foundation for systematic research on general-purpose agents.

研究动机与目标

提出在异构基准上评估通用人工智能代理的理论与实践框架。
通过统一的调解协议将基准语义与代理实现解耦。
提供可扩展的评估工具 Exgentic 与 Open General Agent Leaderboard，以推动系统性比较。

提出的方法

将 Unified Protocol 定义为带有 Task、Context、Actions 字段的调解层，以实现代理与基准细节的解耦。
将 Exgentic 作为带有适配器的编排框架，引导代理 API 与基准协议之间的转换。
在六个环境中对五种代理体系结构进行基准测试，使用三种前沿大模型，构建 Open General Agent Leaderboard。
分析方差以将模型质量、代理结构与任务难度对性能的驱动因素分离。
在不同配置下评估成本-性能权衡与组件贡献（内存、规划、工具短列）等。

Figure 1 : Cost-performance tradeoffs across agent-model configurations. The Pareto frontier (red dashed line) shows optimal tradeoffs: GPT 5.2 configurations offer the best cost-efficiency while Claude Opus 4.5 achieve the highest performance at 3-33 $\times$ higher cost.

实验结果

研究问题

RQ1通用代理是否能够在多样化基准上泛化，而无需环境特定的调优？
RQ2哪些因素（模型质量 vs. 代理架构）在主导通用代理性能？
RQ3哪些代理组件对跨领域能力贡献最大？
RQ4在不同模型-代理配置下，成本效益和稳定性如何变化？
RQ5是否存在在所有基准上占优的单一代理，还是结果取决于模型与任务的搭配？

主要发现

General Agent	Model	Avg Success	Avg Cost	App World	Browse Comp+	SWE BenchV	Tau 2 Airline	Tau 2 Retail	Tau 2 Telecom
OpenAI Solo	Claude Opus 4.5	.73	$8.5	.68	.61	.81	.74	.85	.84
Claude Code	Claude Opus 4.5	.67	$8.0	.66	.53	.74	.66	.83	.76
Smolagent	Claude Opus 4.5	.66	$4.4	.70	.61	.65	.72	.78	.58
ReAct Short	Gemini 3	.62	$0.7	.55	.48	.71	.70	.82	.73
ReAct Short	Claude Opus 4.5	.62	$3.8	.64	.49	.61	.66	.78	.76
ReAct	Gemini 3	.61	$0.8	.51	.48	.71	.70	.82	.73
ReAct	Claude Opus 4.5	.61	$5.8	.61	.49	.61	.66	.78	.76
OpenAI Solo	Gemini 3	.60	$2.8	.58	.33	.72	.62	.73	.89
Claude Code	Gemini 3	.57	$2.5	.36	.51	.67	.70	.78	.69
Smolagent	Gemini 3	.56	$1.8	.13	.57	.76	.68	.76	.88
ReAct Short	GPT 5.2	.46	$0.3	.22	.46	.57	.54	.73	.54
ReAct	GPT 5.2	.41	$0.2	.00	.46	.57	.54	.73	.54
OpenAI Solo	GPT 5.2	.39	$0.2	.00	.48	.55	.50	.54	.53
Claude Code	GPT 5.2	.38	$0.4	.00	.43	.58	.48	.51	.55
Smolagent	GPT 5.2	.38	$0.4	.07	.26	.53	.60	.68	.71

模型质量解释了跨配置的性能方差的主要部分；代理架构解释的方差相对较小。
Claude Opus 4.5通常达到最高的平均性能，而 GPT-5.2 因为工具丰富环境故障而表现最差。
成本-效益在不同配置中差异显著（最高可达约 33x），由模型选择和工具使用驱动。
没有单一代理在所有基准上占优；OpenAI Solo 与四对 Claude/OpenAI 在不同任务中表现出色，显示强烈的模型相关效应。
工具短列与架构守卫在工具丰富环境中提升了性能和鲁棒性。
跨基准相关性介于中等至较强，表明模型质量驱动一般趋势，而代理排名随模型而异。

Figure 2 : Evolution of Agentic Evaluation. (A) Collection of separate benchmarks, each requiring a custom agent or an agent with specific adaptation per benchmark (HAL) (B) Multiple benchmarks consolidated through a single protocol, such as CLI, or Web (C) Multiple benchmarks consolidated through a

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。