QUICK REVIEW

[論文レビュー] Counterfactually Auditable Lifecycle Certification for Autonomous Agents

Yujia Qin|arXiv (Cornell University)|Jul 31, 2023

Natural Language Processing Techniques被引用数 63

ひとこと要約

ToolBenchを導入し、オープンソースLLMのツール使用指示調整、DFSDT推論戦略、そして自動評価ToolEvalを提供し、未知のAPIへの一般化に対して競争力のあるツール使用性能と堅牢さを実現する。

ABSTRACT

Despite the advancements of open-source large language models (LLMs), e.g., LLaMA, they remain significantly limited in tool-use capabilities, i.e., using external tools (APIs) to fulfill human instructions. The reason is that current instruction tuning largely focuses on basic language tasks but ignores the tool-use domain. This is in contrast to the excellent tool-use capabilities of state-of-the-art (SOTA) closed-source LLMs, e.g., ChatGPT. To bridge this gap, we introduce ToolLLM, a general tool-use framework encompassing data construction, model training, and evaluation. We first present ToolBench, an instruction-tuning dataset for tool use, which is constructed automatically using ChatGPT. Specifically, the construction can be divided into three stages: (i) API collection: we collect 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub; (ii) instruction generation: we prompt ChatGPT to generate diverse instructions involving these APIs, covering both single-tool and multi-tool scenarios; (iii) solution path annotation: we use ChatGPT to search for a valid solution path (chain of API calls) for each instruction. To enhance the reasoning capabilities of LLMs, we develop a novel depth-first search-based decision tree algorithm. It enables LLMs to evaluate multiple reasoning traces and expand the search space. Moreover, to evaluate the tool-use capabilities of LLMs, we develop an automatic evaluator: ToolEval. Based on ToolBench, we fine-tune LLaMA to obtain an LLM ToolLLaMA, and equip it with a neural API retriever to recommend appropriate APIs for each instruction. Experiments show that ToolLLaMA demonstrates a remarkable ability to execute complex instructions and generalize to unseen APIs, and exhibits comparable performance to ChatGPT. Our ToolLLaMA also demonstrates strong zero-shot generalization ability in an out-of-distribution tool-use dataset: APIBench.

研究の動機と目的

現実的で複数ツールを用いるシナリオにおいて、オープンソースLLMが外部APIを熟練して使用するよう動機づける。
実世界のRESTful APIを用いて、拡張性のある自動データ生成パイプライン（ToolBench）を作成する。
ツール使用の計画と推論を改善するため、深さ優先探索に基づく意思決定木（DFSDT）を開発する。
ツール使用能力を測定する自動評価フレームワーク（ToolEval）を提供する。
未知のAPIおよび分布外のツール使用ベンチマークへの一般化を示す。

提案手法

RapidAPI から49カテゴリにまたがる16,464個のRESTful APIを収集して ToolBench を構築する。
ChatGPT を用いて多様な単一ツールおよび複数ツールの指示を生成し、DFSDT駆動プロセスを用いて解決経路を注釈付けする。
各ステップで API を呼び出し、思考・選択した API・パラメータを記録する多ラウンド推論を通じて、指示-解決経路を注釈付けする。
ToolBench データで LLaMA-2 (7B) をファインチューニングし、長い API 応答用にコンテキスト長を拡張した ToolLLaMA を得る。
指示に基づいて関連 API を推奨するニューラル API レトリーバを訓練し、検索精度を向上させる。
ツール使用パフォーマンスと解決経路の品質を評価する合格率・勝率指標を備えた自動評価ツール ToolEval を開発する。

実験結果

リサーチクエスチョン

RQ1オープンソースLLMは、単一ツールおよび複数ツール設定で実世界のAPIを習得するようにどれだけ効果的に学習できるだろうか？
RQ2ニューラル API レトリーバは、与えられた指示に対して大規模なプールから関連する API を効果的に識別できるか？
RQ3DFSDT 推論戦略は ReACT と比較して、計画、探索、最終的な成功率を改善しますか？
RQ4ToolLLaMA は未知の API や分布外のツール使用データセットへどの程度一般化するか？
RQ5ツール使用の場面における自動評価（ToolEval）は、人間の判断の代理として信頼できるか？

主な発見

ToolLLaMA は DFSDT を用いて ToolBench でファインチューニングされ、ツール使用タスクで ChatGPT に競合する性能を達成し、GPT-4 に近い。
DFSDT 戦略は ReACT に比べて合格率と勝率を大幅に改善し、特に難易度の高い複数ツール指示で顕著。
ニューラル API レトリーバは API 選択の精度を大幅に向上させ、場合によっては正解とされる API セットを上回ることさえある。
ToolLLaMA は未知の API および分布外データセット（APIBench）へ頑健に一般化し、いくつかの設定でベースラインに匹敵するかそれを上回る。
実務での API レトリーバの使用（上位5件の API）はオラクル API セットを上回ることがあり、レトリーバが有用なツール選択を拡大する能力を強調している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。