QUICK REVIEW

[论文解读] Making Language Models Better Tool Learners with Execution Feedback

Shuofei Qiao, Honghao Gui|arXiv (Cornell University)|May 22, 2023

Topic Modeling被引用 8

一句话总结

TRICE 是一个两阶段框架，通过执行反馈教语言模型何时以及如何使用外部工具，提升选择性工具使用并减少错误传播。

ABSTRACT

Tools serve as pivotal interfaces that enable humans to understand and reshape the environment. With the advent of foundation models, AI systems can utilize tools to expand their capabilities and interact with the real world. Existing tool learning methodologies, encompassing supervised fine-tuning and prompt engineering approaches, often induce large language models to utilize tools indiscriminately, as complex tasks often exceed their own competencies. However, introducing tools for simple tasks, which the models themselves can readily resolve, can inadvertently propagate errors rather than enhance performance. This leads to the research question: can we teach language models when and how to use tools? To meet this need, we propose Tool leaRning wIth exeCution fEedback (TRICE), a two-stage end-to-end framework that enables the model to continually learn through feedback derived from tool execution, thereby learning when and how to use tools effectively. Experimental results, backed by further analysis, show that TRICE can make the large language model selectively use tools by improving the accuracy of tool usage while enhancing insufficient tool learning and mitigating excessive reliance on tools. Code is available at https://github.com/zjunlp/TRICE.

研究动机与目标

激发问题：何时确实需要工具来支持大语言模型，何时不需要。
提出一个两阶段训练框架，通过执行反馈来教授选择性工具使用。
创建一个数据准备管道，在需要时使用一个LLM生成工具使用标签。
证明执行反馈能够提高工具使用准确性并在多任务和多种骨干网络上减少过度依赖。

提出的方法

准备数据集，使用 ChatGPT 生成的伪标签指示何时需要使用工具。
阶段 I：行为克隆，通过对工具使用数据进行指令微调来模仿工具使用行为。
阶段 II：带执行反馈的强化学习（RLEF），通过执行引导强化理想的工具使用。
使用排序损失将模型输出与理想候选回答对齐，并使用监督微调损失来约束输出。
采用基于奖励的策略，对候选回答按答案正确性和与金标准回答的一致性进行评分。
在多种骨干模型和覆盖四类任务的八个数据集上进行评估，涵盖单工具和多工具设置。

实验结果

研究问题

RQ1LLMs 在不同任务中何时调用工具而不过度依赖工具？
RQ2执行反馈是否提升工具使用的准确性并帮助模型学会选择性使用工具？
RQ3在未见数据集和工具上的Trice训练在多大程度上具备泛化能力？
RQ4两阶段训练如何促成稳定有效的工具学习？

主要发现

Trice 实现了选择性工具使用，在多任务和多骨干网络上超越基于提示的基线。
阶段 I（行为克隆）为工具使用能力打下基础，阶段 II（RLEF）提升选择性并缓解过度依赖。
采用 Trice-mix（多任务训练）在多种骨干网络上相较于 Trice-split（按任务）取得了最先进的性能。
Trice 提升了对未见工具和数据集的泛化能力，使在新场景中更好地处理工具。
执行反馈有助于减少错误传播，解决阶段 I 中观察到的工具学习不足问题。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。