QUICK REVIEW

[論文レビュー] ToolGym: an Open-world Tool-using Environment for Scalable Agent Testing and Data Curation

Ziqiao Xi, Shuang Liang|arXiv (Cornell University)|Jan 9, 2026

Software Testing and Debugging Techniques被引用数 0

ひとこと要約

ToolGym は 5,571 のツールを 204 アプリに跨るスケーラブルなオープンワールド環境と、タスク作成エンジンおよび状態コントローラを提供し、長期目標タスクと堅牢なデータキュレーションを伴うツール使用LLMエージェントを評価・訓練します。

ABSTRACT

Tool-using LLM agents still struggle in open-world settings with large tool pools, long-horizon objectives, wild constraints, and unreliable tool states. For scalable and realistic training and testing, we introduce an open-world tool-using environment, built on 5,571 format unified tools across 204 commonly used apps. It includes a task creation engine that synthesizes long-horizon, multi-tool workflows with wild constraints, and a state controller that injects interruptions and failures to stress-test robustness. On top of this environment, we develop a tool select-then-execute agent framework with a planner-actor decomposition to separate deliberate reasoning and self-correction from step-wise execution. Comprehensive evaluation of state-of-the-art LLMs reveals the misalignment between tool planning and execution abilities, the constraint following weakness of existing LLMs, and DeepSeek-v3.2's strongest robustness. Finally, we collect 1,170 trajectories from our environment to fine-tune LLMs, achieving superior performance to baselines using 119k samples, indicating the environment's value as both a realistic benchmark and a data engine for tool-using agents. Our code and data will be publicly released.

研究の動機と目的

多数のツールと長いワークフローを伴うオープンワールド設定で、ツール使用LLMエージェントの現実的でスケーラブルな評価の必要性を動機づける。
大規模なキュレーションツールベース、野外制約下のタスク合成、障害をシミュレートする状態コントローラを備えた統一環境として ToolGym を紹介する。
長期的なタスクにおける推論と実行を分離し、堅牢性を改善するプランナー–アクターのエージェントフレームワークを提案する。
ToolGym のデータを用いてモデルを効率的にファインチューニングし、限られたデータで性能を向上させることができることを実証する。

提案手法

276 台の MCP サーバーから 5,571 のツールを 204 アプリケーション分を統一 MCP フォーマットへキュレーションする。
長期的・マルチツールのワークフローを野外制約下で合成するタスク作成エンジンを開発する。
実行中に介入と障害（ツールレベル、状態レベル、制約レベル）を注入する状態コントローラを導入する。
熟考型の計画と逐次実行を分離するプランナー–アクターのエージェントフレームワークを実装する（Planner が指針を、Actor が実行を担当）。
タスクの成功と制約遵守をLLM-judge プロトコルで評価する、ツール取得・計画・実行を統合したエンセムブル型ループを用いて複数の LLM を評価する。
LLM のファインチューニング用に自動データパイプラインを作成し、1,170 の ToolGym 軌跡を用いてファインチューニングを行い、より大きなデータセットで訓練されたベースラインと比較する。

Figure 1: The overall framework of ToolGym . The pipeline begins with curating real-world MCP tools and synthesizing tasks with wild constraints (Left). The agent employs a Planner–Actor architecture to decompose long-horizon goals, where the Actor interacts with the environment via a State Controll

実験結果

リサーチクエスチョン

RQ1現在の LLM が大規模でオープンなツールライブラリを用いて長期的なワークフローを計画・実行できるか。
RQ2オープンワールドのツール環境で動作する際の支配的な障害モード（計画 vs 実行 vs 制約遵守）は何か。
RQ3プランナー–アクターの分解はツールベースのタスクにおいて堅牢性と成功率を向上させるか。
RQ4ToolGym の軌跡データはツール使用のための下流の LLM 調整をどれだけ効率的に改善できるか。
RQ5ToolGym の自動データ生成はツール使用エージェントのための人間注釈データよりも優れているか。

主な発見

LLMs は強い計画能力を示す一方で実行の整合性が弱く、タスク成功にギャップを生じさせる。
現在のモデルにとっての主要なボトルネックは、ツールの呼び出しよりも制約遵守の方にある。
DeepSeek-v3.2 は中断下での高い回復力と適応力を示す優れた堅牢性を実証。
ツール使用率が高いほど必ずしも成功率が高くなるわけではなく、推論の失敗が要因となる場合がある。
1,170 の ToolGym 軌跡でのファインチューニングは、119k サンプルで訓練されたベースラインより優れた性能を示す。
ToolGym のデータは評価とデータキュレーションの両方において効果的でデータ効率の高いエンジンとして機能する。

Figure 2: Five-dimensional personality radar charts of different MCP-based agents.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。