QUICK REVIEW

[论文解读] TPTU-v2: Boosting Task Planning and Tool Usage of Large Language Model-based Agents in Real-world Systems

Yilun Kong, Jingqing Ruan|arXiv (Cornell University)|Nov 19, 2023

Topic Modeling被引用 7

一句话总结

该论文提出一个三组件框架（API Retriever、LLM Finetuner、Demo Selector）以提升真实系统中基于LLM的代理在任务规划和API使用上的表现，并在真实商业数据与 ToolBench 上得到验证。

ABSTRACT

Large Language Models (LLMs) have demonstrated proficiency in addressing tasks that necessitate a combination of task planning and the usage of external tools that require a blend of task planning and the utilization of external tools, such as APIs. However, real-world complex systems present three prevalent challenges concerning task planning and tool usage: (1) The real system usually has a vast array of APIs, so it is impossible to feed the descriptions of all APIs to the prompt of LLMs as the token length is limited; (2) the real system is designed for handling complex tasks, and the base LLMs can hardly plan a correct sub-task order and API-calling order for such tasks; (3) Similar semantics and functionalities among APIs in real systems create challenges for both LLMs and even humans in distinguishing between them. In response, this paper introduces a comprehensive framework aimed at enhancing the Task Planning and Tool Usage (TPTU) abilities of LLM-based agents operating within real-world systems. Our framework comprises three key components designed to address these challenges: (1) the API Retriever selects the most pertinent APIs for the user task among the extensive array available; (2) LLM Finetuner tunes a base LLM so that the finetuned LLM can be more capable for task planning and API calling; (3) the Demo Selector adaptively retrieves different demonstrations related to hard-to-distinguish APIs, which is further used for in-context learning to boost the final performance. We validate our methods using a real-world commercial system as well as an open-sourced academic dataset, and the outcomes clearly showcase the efficacy of each individual component as well as the integrated framework.

研究动机与目标

识别基于LLM的代理在真实系统中的实际挑战（庞大的 API 集、复杂的任务/API 序列、以及 API 相似性）。
提出一个三组件框架以应对这些挑战：API Retriever、LLM Finetuner、与 Demo Selector。
在真实世界与开源数据集上展示各组件及整合框架的有效性。

提出的方法

API Retriever 使用语义嵌入和双流 SBERT 训练与 Multiple Negatives Ranking Loss 从大型 API 集合中选择最相关的 API。
LLM Finetuner 在精心构建的数据集中进行监督微调，以提升真实场景中的任务规划与 API 调用效果。
Demo Selector 基于嵌入相似度动态获取演示（子任务级别或 API 级别），以改善上下文学习并区分相似 API。
API Retriever 的训练数据依赖于 instruction-API 对以及混合的人/LLM 标注过程。
微调数据集包括 Training Set v1（真实世界分布）、Training Set v2（带特性列表的提示功能）、Training Set v3（多样化提示与多步 API 交互）。
Demo Selector 使用 Knowledge Database 与 API Collection 的嵌入来获取前 k 条演示，必要时回退到 API 级别的演示。

实验结果

研究问题

RQ1 API 检索在大型 API 生态中对任务规划的相关性提升效果有多大？
RQ2在领域特定数据上对 LLM 进行微调是否能提升任务规划与 API 调用的准确性？
RQ3自适应演示检索是否能帮助模型区分语义上相似的 API 并提升最终任务完成度？

主要发现

API Retriever 在真实场景下实现 Recall@5 为 84.64%、Recall@10 为 98.47%。
使用基础 LLM 的执行准确率从 38.89%（无演示）提升到使用 API Retriever 的 43.33%，再到使用 Demo Selector 的 95.55%，以及对微调后的 LLM 加 API Retriever 的 80%，在全部组件作用下达到 96.67%。
在开源场景中，基础 LLM 的执行准确率为 76.67%；仅 API Retriever 因复杂性下降至 53.3%，而微调 LLM + API Retriever 达到 86.7%。
最高的真实场景性能（96.67%）来自于微调 LLM、API Retriever 与 Demo Selector 的组合，凸显整合组件的价值。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。