QUICK REVIEW

[論文レビュー] ToolQA: A Dataset for LLM Question Answering with External Tools

Yuchen Zhuang, Yue Yu|arXiv (Cornell University)|Jun 23, 2023

Topic Modeling被引用数 39

ひとこと要約

ToolQA は、8つのドメインと13のツールを横断して外部ツールの使用を分離するQAベンチマークで、容易な質問と難しい質問に対するツール追加モデルの強みと限界を明らかにします。

ABSTRACT

Large Language Models (LLMs) have demonstrated impressive performance in various NLP tasks, but they still suffer from challenges such as hallucination and weak numerical reasoning. To overcome these challenges, external tools can be used to enhance LLMs' question-answering abilities. However, current evaluation methods do not distinguish between questions that can be answered using LLMs' internal knowledge and those that require external information through tool use. To address this issue, we introduce a new dataset called ToolQA, which is designed to faithfully evaluate LLMs' ability to use external tools for question answering. Our development of ToolQA involved a scalable, automated process for dataset curation, along with 13 specialized tools designed for interaction with external knowledge in order to answer questions. Importantly, we strive to minimize the overlap between our benchmark data and LLMs' pre-training data, enabling a more precise evaluation of LLMs' tool-use reasoning abilities. We conducted an in-depth diagnosis of existing tool-use LLMs to highlight their strengths, weaknesses, and potential improvements. Our findings set a new benchmark for evaluating LLMs and suggest new directions for future advancements. Our data and code are freely available to the broader scientific community on GitHub.

研究の動機と目的

LLM が外部ツールを使って質問応答を行う能力を堅牢に評価することを促進し、ツールの使用と内部知識の想起を分離する。
最小限の人手ラベリングでツール依存型のQAデータを作成する、スケーラブルで自動化されたデータ生成パイプラインを提供する。
テキスト、表データ、グラフ、コード実行を網羅する多様な参照コーパスとツール群を整備する。
標準のLLMとツール補助型LLMの基礎性能を測定し、エラーモードを分析して今後の改善を導く。

提案手法

自動化された3段階データセット構築: 参照データ収集、人的ガイド付き質問生成、プログラム可能な回答生成。
テキスト検索、データベース操作、数値計算、グラフ問合せ、コード解釈を網羅する13の専門ツールの設計。
人間の検証に導かれたテンプレートベースの質問生成を用い、参照コーパスよりもツールの使用を必要とする質問になるようにする。
事前定義されたツールオペレータとツールチェーンによるプログラム的な回答生成で、参照データから正確な回答を作成。
特定のツールチェーンの使用に依存せず、最終回答の正確性に焦点を当てたオープンエンド評価。

Figure 1: Pre-trained on vast range of corpus, LLMs possess extensive knowledge, which may overlap with evaluation data. This overlap poses a significant challenge to current evaluation methods, as it becomes difficult to discern whether the model is merely recalling pre-trained information or genui

実験結果

リサーチクエスチョン

RQ1LLM は内部の事前学習知識に依存せず、外部ツールを必要とする質問に信頼性高く答えられるか？
RQ2複雑なクエリのためのマルチステップツールチェーンを作成・実行する能力は、現在のツール補助型LLMでどれほど高いか？
RQ3LLM が外部ツールを用いてQAを行う際の主要なエラーモードは何か、そしてそれらは易問と難問でどう異なるか？

主な発見

Tool-augmented LLMs outperform purely internal reasoning on ToolQA’s easy questions but still struggle with hard questions.
ReAct-based approaches show strongest performance among baselines, yet hard questions yield low success rates (e.g., 8.2% average for hard questions).
ChatGPT and chain-of-thought prompts perform poorly on ToolQA, underscoring the need for explicit tool use.
Main error types include incorrect tool arguments, incorrect data sources, and innovation hallucinations, especially on harder tasks.
Hard questions require more complex tool compositions and reasoning, highlighting current limits in tool-use planning and execution.
ToolQA data is drawn from out-of-scope sources with careful overlap minimization with LLM pre-training for fair evaluation.

Figure 2: ToolQA, aiming to faithfully evaluate LLMs’ abilities to use external tools, curates data through three phases: (a) Reference Data Collection; (b) Human-Guided Question Generation; and (c) Programmatic Answer Generation.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。