QUICK REVIEW

[论文解读] Tool-R0: Self-Evolving LLM Agents for Tool-Learning from Zero Data

Emre Can Acikgoz, Cheng Qian|arXiv (Cornell University)|Feb 24, 2026

Reinforcement Learning in Robotics被引用 0

一句话总结

Tool-R0 从头开始使用自我对弈强化学习，通过生成器和求解器来训练通用工具调用代理，在没有任何人类数据的情况下实现显著提升，并且优于监督基线。

ABSTRACT

Large language models (LLMs) are becoming the foundation for autonomous agents that can use tools to solve complex tasks. Reinforcement learning (RL) has emerged as a common approach for injecting such agentic capabilities, but typically under tightly controlled training setups. It often depends on carefully constructed task-solution pairs and substantial human supervision, which creates a fundamental obstacle to open-ended self-evolution toward superintelligent systems. In this paper, we propose Tool-R0 framework for training general purpose tool-calling agents from scratch with self-play RL, under a zero-data assumption. Initialized from the same base LLM, Tool-R0 co-evolves a Generator and a Solver with complementary rewards: one proposes targeted challenging tasks at the other's competence frontier and the other learns to solve them with real-world tool calls. This creates a self-evolving cycle that requires no pre-existing tasks or datasets. Evaluation on different tool-use benchmarks show that Tool-R0 yields 92.5 relative improvement over the base model and surpasses fully supervised tool-calling baselines under the same setting. Our work further provides empirical insights into self-play LLM agents by analyzing co-evolution, curriculum dynamics, and scaling behavior.

研究动机与目标

因可定制数据集的规模限制，激发在没有人类数据的情况下学习工具调用。
引入一个自进化的两角色（Generator 与 Solver）的强化学习框架。
设计有据可循、可控的任务生成与难度感知的课程。
在不同模型规模与架构下演示零数据工具学习。

提出的方法

将基础 LLM 初始化为两种协同进化角色：Generator 与 Solver。
以领域受控的规范（领域、上下文、工具、答案）对任务生成进行绑定。
通过 GRPO 训练 Generator 以产生可验证、具挑战性的任务，并定义包含格式、有效性与课程信号的多组件奖励。
通过去重、交叉验证与基于难度的分组，从 Generator 的输出构建经过筛选的数据集用于 Solver 训练。
训练 Solver 使用推理提示和支持自动验证的输出结构，从查询和工具菜单中预测工具调用。
在五个工具调用基准测试上评估 Tool-R0，使用基于 AST 的匹配，并分析课程动态、共进化与扩展性。

实验结果

研究问题

RQ1 Tool-R0 是否能够让基础 LLM 从零开始通过自我对弈学习复杂的工具调用技能？
RQ2模型规模如何影响 Tool-R0 的工具调用性能？
RQ3Tool-R0 在不同基础模型家族（例如 Qwen 与 Llama）上是否具有鲁棒性？
RQ4与使用人类数据进行监督训练的模型相比，Tool-R0 的表现如何？
RQ5自我对弈动态、架构分离与课程设计对学习的影响是什么？

主要发现

Tool-R0 在基准上对基础模型的相对平均提升为 92.52%。
在 Tool-R0 的条件下，0.5B 模型的平均准确率超过了 1.5B 基线模型，1.5B 模型又超过了 3B 基线模型。
Tool-R0 同时提升了 Qwen 与 Llama 家族，表明在多种架构上具有模型无关的收益。
在零筛选数据的情况下，Tool-R0 超越了以数千个人工标注示例训练的监督基线（ToolRL 的平均 47.84% 对比 46.06%）。
在高熵工具使用场景中，为稳定共进化，分离的 Generator 与 Solver 参数是关键。
冻结 Generator 或移除课程/难度奖励会降低 Solver 的性能，验证了需要主动的 Generator 学习与自适应奖励。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。