QUICK REVIEW

[论文解读] Dr. Zero: Self-Evolving Search Agents without Training Data

Zhenrui Yue, Kartikeya Upasani|arXiv (Cornell University)|Jan 11, 2026

Topic Modeling被引用 0

一句话总结

Dr. Zero 提出一个数据无关的提议-求解器框架，通过 hop-grouped 相对策略优化（HRPO）和难度引导奖励实现自我演化的开放域搜索代理，达到或超过有监督基线。

ABSTRACT

As high-quality data becomes increasingly difficult to obtain, data-free self-evolution has emerged as a promising paradigm. This approach allows large language models (LLMs) to autonomously generate and solve complex problems, thereby improving their reasoning capabilities. However, multi-turn search agents struggle in data-free self-evolution due to the limited question diversity and the substantial compute required for multi-step reasoning and tool using. In this work, we introduce Dr. Zero, a framework enabling search agents to effectively self-evolve without any training data. In particular, we design a self-evolution feedback loop where a proposer generates diverse questions to train a solver initialized from the same base model. As the solver evolves, it incentivizes the proposer to produce increasingly difficult yet solvable tasks, thus establishing an automated curriculum to refine both agents. To enhance training efficiency, we also introduce hop-grouped relative policy optimization (HRPO). This method clusters structurally similar questions to construct group-level baselines, effectively minimizing the sampling overhead in evaluating each query's individual difficulty and solvability. Consequently, HRPO significantly reduces the compute requirements for solver training without compromising performance or stability. Extensive experiment results demonstrate that the data-free Dr. Zero matches or surpasses fully supervised search agents, proving that complex reasoning and search capabilities can emerge solely through self-evolution.

研究动机与目标

研究在开放域问答中仅以外部检索作为监督的零数据自我演化。
使提议者-求解器联合训练，以生成多样且具有挑战性的多跳问题。
通过 hop-grouped 相对策略优化在保持性能的同时降低训练计算量。
证明数据无关自我演化在多个基准上可以匹配或超过有监督基线。

提出的方法

从同一基准大模型初始化的提议者-求解器框架，并使用外部搜索引擎 R。
Hop-grouped 相对策略优化（HRPO）将 QA 对按跳数聚类，并计算基于组的优势。
为提议者设计的难度引导奖励，以生成可验证且非平庸的问题（可解但不简单）。
通过组相对策略优化（GRPO）进行求解器训练，以最终预测准确度作为主要信号。
交替优化循环，在求解器改进的同时促使提议者设计更难的问题，从而形成课程。

实验结果

研究问题

RQ1数据无关自我进化的提议者与求解器训练是否能在开放域问答中匹配或超过有监督基线？
RQ2在多轮、使用工具的设置中，HRPO 是否在保持或提升性能的同时降低计算成本？
RQ3多跳问题的比例与结构如何影响学习动态和最终性能？
RQ4零数据自我演化在搜索代理中的稳定性与训练动态特征如何？
RQ5该框架是否可在不同基础模型规模（3B 与 7B）之间实现泛化？

主要发现

NQ	TriviaQA	PopQA	HotpotQA	2WikiMQA	MuSiQue	Bamboogle	Average
0.397	0.572	0.431	0.298	0.291	0.091	0.200	0.326
0.406	0.608	0.416	0.362	0.347	0.104	0.360	0.372

Dr. Zero 在多基准上无需训练数据即可匹配或超过有监督的搜索代理。
使用 3B 与 7B 基座时，Dr. Zero 在单跳与多跳任务上均取得强劲结果，包括具有挑战性的 2WikiMQA。
HRPO 通过避免嵌套采样显著降低了提议者训练计算，同时保持性能。
Dr. Zero 超越了若干数据无关基线（SQLM*、R-Zero*），并接近或超过基于 RL 的有监督基线。
模型规模从 3B 增至 7B 时，在给定适当具有挑战性的课程的前提下，对多跳基准的提升更加显著。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。