QUICK REVIEW

[论文解读] $τ$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn|arXiv (Cornell University)|Jun 17, 2024

Semantic Web and Ontologies被引用 5

一句话总结

论文提出一个基准，用于在现实世界领域评估工具-代理-用户互动，关注代理如何在用户指导下选择、交换和推理工具。提供的文本主要展示一个交互式对话记录，而非完整的实验报告。

ABSTRACT

Existing benchmarks do not test language agents on their interaction with human users or ability to follow domain-specific rules, both of which are vital for deploying them in real world applications. We propose $τ$-bench, a benchmark emulating dynamic conversations between a user (simulated by language models) and a language agent provided with domain-specific API tools and policy guidelines. We employ an efficient and faithful evaluation process that compares the database state at the end of a conversation with the annotated goal state. We also propose a new metric (pass^k) to evaluate the reliability of agent behavior over multiple trials. Our experiments show that even state-of-the-art function calling agents (like gpt-4o) succeed on <50% of the tasks, and are quite inconsistent (pass^8 <25% in retail). Our findings point to the need for methods that can improve the ability of agents to act consistently and follow rules reliably.

研究动机与目标

在现实任务中激励基准化工具-代理-用户互动的必要性。
定义一个基准，用于评估代理在选择和交换工具时的决策。
评估用户交互如何影响工具选择和任务结果。
提供一个框架，用于追踪和改进跨多领域的工具兼容性与用户满意度。

提出的方法

提出一个跨现实世界领域的工具-代理-用户互动基准框架。
描述包括工具选择、交换和用户确认行动在内的交互工作流。
概述工具兼容性、用户满意度和结果质量的评估标准。

实验结果

研究问题

RQ1如何在现实世界任务中有效基准化工具-代理-用户互动？
RQ2哪些标准最能体现代理在工具选择与交换决策方面的质量？
RQ3在多领域情境中，用户输入如何影响代理的选择和整体任务成功？
RQ4哪些指标可以在跨领域中稳健地衡量兼容性和满意度？

主要发现

所提供的摘录主要包含一个交互记录，并未给出明确的定量结果。
在所提供文本中未显示明确的基准结果或指标。
存在工具选择与交换过程的证据，但缺乏整合的评估结果。
文本未包含正式的实验设计或对比分析。
因此，无法从给定的源片段提取具体的数值发现。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。