QUICK REVIEW

[论文解读] TravelPlanner: A Benchmark for Real-World Planning with Language Agents

Jian Xie, Kai Zhang|arXiv (Cornell University)|Feb 2, 2024

Multi-Agent Systems and Negotiation被引用 13

一句话总结

TravelPlanner 引入了一个现实世界的旅行规划基准，包含 1,225 个标注查询和一个 4 百万条数据的沙盒，用于测试语言代理的工具使用与多约束规划；当前的大型语言模型在复杂任务上的最终通过率仅为 0.6%。

ABSTRACT

Planning has been part of the core pursuit for artificial intelligence since its conception, but earlier AI agents mostly focused on constrained settings because many of the cognitive substrates necessary for human-level planning have been lacking. Recently, language agents powered by large language models (LLMs) have shown interesting capabilities such as tool use and reasoning. Are these language agents capable of planning in more complex settings that are out of the reach of prior AI agents? To advance this investigation, we propose TravelPlanner, a new planning benchmark that focuses on travel planning, a common real-world planning scenario. It provides a rich sandbox environment, various tools for accessing nearly four million data records, and 1,225 meticulously curated planning intents and reference plans. Comprehensive evaluations show that the current language agents are not yet capable of handling such complex planning tasks-even GPT-4 only achieves a success rate of 0.6%. Language agents struggle to stay on task, use the right tools to collect information, or keep track of multiple constraints. However, we note that the mere possibility for language agents to tackle such a complex problem is in itself non-trivial progress. TravelPlanner provides a challenging yet meaningful testbed for future language agents.

研究动机与目标

评估由大语言模型驱动的语言代理是否能够在一个现实沙箱环境中执行复杂的多约束旅行规划。
在环境、常识和硬约束下评估工具使用和规划策略的有效性。
识别当前语言代理在长期开规划任务中的常见失败模式。
提供一个具有挑战性的测试基准，推动更强大的语言代理朝着人类水平的规划发展。

提出的方法

创建一个静态沙箱环境，包含六种数据工具和约400万条旅行数据记录。
设计1,225个多样化查询，具有不同持续时间和硬约束，以及参考规划。
为所有查询注释可由人类实现的参考规划，以确保至少有一个可行解。
使用微观和宏观指标评估代理在交付率、常识约束通过率、硬约束通过率以及最终通过率。
在两阶段与单独规划模式下，比较多种大语言模型（GPT-4-Turbo、Gemini Pro、Mixtral 等）和规划策略（Direct、CoT、ReAct、Reflexion）。
分析包括工具使用错误、死循环和幻觉在内的失败模式，以了解规划差距。

实验结果

研究问题

RQ1最先进的语言代理能否使用一整套信息收集工具生成可行的多约束旅行计划？
RQ2像 ReAct 与 Reflexion 这样的规划策略在具有多重约束的复杂现实世界规划任务中表现如何？
RQ3在 TravelPlanner 上阻碍性能的主导失败模式（工具使用错误、死循环、幻觉）是什么？
RQ4两阶段（信息收集+规划）与单独规划模式下代理的性能有何差异？
RQ5在这个复杂任务中，微观约束通过率与宏观约束通过率之间的差距是多少？

主要发现

在两阶段模式下，使用 ReAct 的 GPT-4-Turbo 在测试集上的最终通过率为 0.6%。
大多数其他LLM 在 TravelPlanner 上无法完成任何任务。
两阶段规划在各指标上的表现低于单独规划，差距高达30%以上。
代理难以满足硬约束并维持对多重约束的整体考量（宏观通过率较低）。
常见失败模式包括工具使用中的参数错误、死循环和幻觉，指向需要更复杂的规划策略和工具推理。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。