QUICK REVIEW

[논문 리뷰] Tool-R0: Self-Evolving LLM Agents for Tool-Learning from Zero Data

Emre Can Acikgoz, Cheng Qian|arXiv (Cornell University)|2026. 02. 24.

Reinforcement Learning in Robotics인용 수 0

한 줄 요약

Tool-R0는 Generator와 Solver로 구성된 Generator와 Solver를 가진 셀프 플레이 RL로 일반 목적 도구 호출 에이전트를 처음부터 훈련시키며, 인간 데이터 없이도 큰 개선을 달성하고 감독적 baselines를 능가한다.

ABSTRACT

Large language models (LLMs) are becoming the foundation for autonomous agents that can use tools to solve complex tasks. Reinforcement learning (RL) has emerged as a common approach for injecting such agentic capabilities, but typically under tightly controlled training setups. It often depends on carefully constructed task-solution pairs and substantial human supervision, which creates a fundamental obstacle to open-ended self-evolution toward superintelligent systems. In this paper, we propose Tool-R0 framework for training general purpose tool-calling agents from scratch with self-play RL, under a zero-data assumption. Initialized from the same base LLM, Tool-R0 co-evolves a Generator and a Solver with complementary rewards: one proposes targeted challenging tasks at the other's competence frontier and the other learns to solve them with real-world tool calls. This creates a self-evolving cycle that requires no pre-existing tasks or datasets. Evaluation on different tool-use benchmarks show that Tool-R0 yields 92.5 relative improvement over the base model and surpasses fully supervised tool-calling baselines under the same setting. Our work further provides empirical insights into self-play LLM agents by analyzing co-evolution, curriculum dynamics, and scaling behavior.

연구 동기 및 목표

큐레이션된 데이터의 확장성 한계로 인한 인간 데이터 없이 도구 호출 학습의 동기 부여.
두 가지 역할(Generator와 Solver)로 구성된 셀프-에볼링 RL 프레임워크 도입.
grounded, controllable task generation and difficulty-aware curricula.
모델 스케일과 아키텍처에 걸친 제로-데이터 도구 학습 시연 across model scales and architectures.

제안 방법

기초 LLM을 두 개의 공동 진화 역할인 Generator와 Solver로 초기화한다.
도메인 제어 사양(도메인, 맥락, 도구, 정답)을 사용하여 작업 생성을 구체화한다.
형식, 타당성, 커리큘럼 신호를 포함하는 다중 구성 보상을 정의하고 검증 가능한 도전적 작업을 생성하도록 GRPO로 Generator를 학습시킨다.
Generator 출력으로부터 중복 제거, 상호 검증, 난이도 기반 배치 등을 통해 Solver 학습용 큐레이티드 데이터셋을 구성한다.
추론 프롬프트와 자동 검증을 지원하는 출력 구조를 사용하여 질의와 도구 메뉴로부터 도구 호출을 예측하도록 Solver를 학습시킨다.
AST 기반 매칭을 사용하여 다섯 개의 도구 호출 벤치마크에서 Tool-R0를 평가하고 커리큘럼 역학, 공동 진화 및 확장을 분석한다.

실험 결과

연구 질문

RQ1Tool-R0이 기초 LLM으로부터 셀프 플레이를 통해 복잡한 도구 호출 기술을 학습할 수 있는가?
RQ2모델 크기가 Tool-R0의 도구 호출 성능에 어떤 영향을 미치는가?
RQ3Tool-R0가 다른 기초 모델 계열(Qwen 대 Llama 등)에서 견고한가?
RQ4인간 데이터를 사용해 학습한 감독 모델과 Tool-R0의 비교는 어떠한가?
RQ5셀프-플레이 동역학, 아키텍처 분리, 커리큘럼 설계가 학습에 미치는 영향은 무엇인가?

주요 결과

Tool-R0은 벤치마크 전반에서 기초 모델 대비 평균 상대 개선율 92.52%를 달성한다.
Tool-R0를 사용할 때 0.5B 모델이 평균 정확도에서 1.5B 기초 모델을 능가하고, 1.5B 모델이 3B 기초 모델을 능가한다.
Tool-R0는 Qwen 및 Llama 계열 모두를 개선하여 아키텍처에 구애받지 않는 이점을 나타낸다.
제로 큐레이티드 데이터로도 Tool-R0가 수천 개의 인간 주석 예제로 학습된 감독 baselines을 능가한다(평균 47.84% 대 46.06%의 ToolRL과 비교).
고엔트로피 도구 사용 환경에서 안정적인 공동 진화를 위해 Generator와 Solver 매개변수 분리의 중요성이 확인된다.
Generator를 고정시키거나 커리큘럼/난이도 보상을 제거하면 Solver 성능이 떨어져, 능동적 Generator 학습과 적응형 보상의 필요성을 확인시킨다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.