Skip to main content
QUICK REVIEW

[论文解读] ToolQA: A Dataset for LLM Question Answering with External Tools

Yuchen Zhuang, Yue Yu|arXiv (Cornell University)|Jun 23, 2023
Topic Modeling被引用 39
一句话总结

ToolQA 是一个QA基准测试,在8个领域、13种工具中通过LLMs隔离外部工具的使用,揭示在简单与困难问题上工具增强模型的优缺点。

ABSTRACT

Large Language Models (LLMs) have demonstrated impressive performance in various NLP tasks, but they still suffer from challenges such as hallucination and weak numerical reasoning. To overcome these challenges, external tools can be used to enhance LLMs' question-answering abilities. However, current evaluation methods do not distinguish between questions that can be answered using LLMs' internal knowledge and those that require external information through tool use. To address this issue, we introduce a new dataset called ToolQA, which is designed to faithfully evaluate LLMs' ability to use external tools for question answering. Our development of ToolQA involved a scalable, automated process for dataset curation, along with 13 specialized tools designed for interaction with external knowledge in order to answer questions. Importantly, we strive to minimize the overlap between our benchmark data and LLMs' pre-training data, enabling a more precise evaluation of LLMs' tool-use reasoning abilities. We conducted an in-depth diagnosis of existing tool-use LLMs to highlight their strengths, weaknesses, and potential improvements. Our findings set a new benchmark for evaluating LLMs and suggest new directions for future advancements. Our data and code are freely available to the broader scientific community on GitHub.

研究动机与目标

  • 推动对LLMs使用外部工具进行问答能力的稳健评估,将工具使用与内部知识回忆区分开。
  • 提供可扩展的自动化数据生成流程,尽量减少人工标注,创建依赖于工具的问答数据。
  • 整理多样化的参考语料库和一套覆盖文本、表格数据、图形以及代码执行的工具。
  • 衡量标准LLMs与工具增强LLMs的基线性能并分析错误模式,以引导未来改进。

提出的方法

  • 自动化三阶段数据集构建:参考数据收集、人工引导的问题生成,以及程序化答案生成。
  • 设计13个专门工具,涵盖文本检索、数据库操作、数学计算、图查询和代码解释。
  • 基于模板的问题生成,在人工验证的引导下确保问题需要使用工具而非仅依赖参考语料。
  • 通过预定义的工具算子和工具链进行程序化答案生成,从参考数据中产生正确答案。
  • 开放式评估,关注最终答案的正确性,而不论所使用的具体工具链。
Figure 1: Pre-trained on vast range of corpus, LLMs possess extensive knowledge, which may overlap with evaluation data. This overlap poses a significant challenge to current evaluation methods, as it becomes difficult to discern whether the model is merely recalling pre-trained information or genui
Figure 1: Pre-trained on vast range of corpus, LLMs possess extensive knowledge, which may overlap with evaluation data. This overlap poses a significant challenge to current evaluation methods, as it becomes difficult to discern whether the model is merely recalling pre-trained information or genui

实验结果

研究问题

  • RQ1LLMs 是否能够可靠地回答需要外部工具的问题,而不是依赖内部的预训练知识?
  • RQ2当前的工具增强型LLMs 在构建和执行多步骤工具链以应对复杂查询方面有多高的熟练度?
  • RQ3LLMs 使用外部工具进行QA时的主要错误模式是什么?这些在简单与困难问题之间有何差异?

主要发现

  • 工具增强的LLMs 在 ToolQA 的简单问题上优于纯内部推理,但在困难问题上仍然表现不佳。
  • 基于 ReAct 的方法在基线中表现最强,但困难问题的成功率仍然很低(例如困难问题平均 8.2%)。
  • ChatGPT 与逐步推理提示在 ToolQA 上表现不佳,强调需要显式的工具使用。
  • 主要错误类型包括工具参数错误、数据源错误和创新性幻觉,特别是在更难的任务上。
  • 困难问题需要更复杂的工具组合和推理,凸显当前在工具使用规划与执行方面的局限。
  • ToolQA 数据来自于超出范围的来源,并谨慎降低与LLM预训练的重叠,以实现公平评估。
Figure 2: ToolQA, aiming to faithfully evaluate LLMs’ abilities to use external tools, curates data through three phases: (a) Reference Data Collection; (b) Human-Guided Question Generation; and (c) Programmatic Answer Generation.
Figure 2: ToolQA, aiming to faithfully evaluate LLMs’ abilities to use external tools, curates data through three phases: (a) Reference Data Collection; (b) Human-Guided Question Generation; and (c) Programmatic Answer Generation.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。