QUICK REVIEW

[论文解读] TreePS-RAG: Tree-based Process Supervision for Reinforcement Learning in Agentic RAG

Tianhua Zhang, Kun Li|arXiv (Cornell University)|Jan 11, 2026

Topic Modeling被引用 0

一句话总结

TreePS-RAG 提出在线树结构展开用于 agentic RAG 的过程监督，通过对后代结果的 Monte Carlo 估计实现逐步过程监督，无需中间标签，提升强化学习训练效率与 QA 性能。

ABSTRACT

Agentic retrieval-augmented generation (RAG) formulates question answering as a multi-step interaction between reasoning and information retrieval, and has recently been advanced by reinforcement learning (RL) with outcome-based supervision. While effective, relying solely on sparse final rewards limits step-wise credit assignment and provides weak guidance for intermediate reasoning and actions. Recent efforts explore process-level supervision, but typically depend on offline constructed training data, which risks distribution shift, or require costly intermediate annotations. We present TreePS-RAG, an online, tree-based RL framework for agentic RAG that enables step-wise credit assignment while retaining standard outcome-only rewards. Our key insight is to model agentic RAG reasoning as a rollout tree, where each reasoning step naturally maps to a node. This tree structure allows step utility to be estimated via Monte Carlo estimation over its descendant outcomes, yielding fine-grained process advantages without requiring intermediate labels. To make this paradigm practical, we introduce an efficient online tree construction strategy that preserves exploration diversity under a constrained computational budget. With a rollout cost comparable to strong baselines like Search-R1, experiments on seven multi-hop and general QA benchmarks across multiple model scales show that TreePS-RAG consistently and significantly outperforms both outcome-supervised and leading process-supervised RL methods.

研究动机与目标

在稀疏的最终奖励之外，推动在 agentic RAG 中改进信用分配。
提出一种在线树结构展开，以实现无需中间标签的逐步监督。
发展高效的在线树构建与保留多样性的剪枝策略。
在 QA 基准 across 评测中展示相较仅 outcomes 与现有过程监督 RL 基线的改进。

提出的方法

将 agentic RAG 的展开建模为一棵有根树，其中每一步为一个节点，叶子对应最终答案。
对后代叶子进行 Monte Carlo 估计，以为内部节点分配过程值 V(n)，并计算过程优势。
从节点值计算全局与局部优势，并将其结合成用于策略优化的归一化过程优势 A(n)。
实现在线、深度受限的树展开，采用预算友好的分支 B_d = ceil(N / |M(d-1)|) 来控制计算。
在同胞搜索子节点上应用基于相似度的剪枝，使用前 K 条检索段落的 Jaccard 相似度以维持多样的后续。
在策略梯度更新中，将节点级过程优势广播给该步骤内生成的所有标记。

实验结果

研究问题

RQ1过程监督是否可以在没有显式逐步注释的情况下改善 agentic RAG 的学习，超越最终结果奖励？
RQ2在线树展开是否能在与传统基于结果的 RL 相当的展开预算下提供密集的信用分配？
RQ3相似性剪枝与 Monte Carlo 推导的过程值是否比标准方法带来更好的探索与学习信号？
RQ4在多套 QA 基准与模型规模上，TreePS-RAG 的性能如何与结果监督以及其他过程监督 RL 方法相比？

主要发现

TreePS-RAG 在七个 QA 基准上对四种骨干模型始终优于竞争基线。
_online rollout 成本保持与基于结果的方法（如 Search-R1）相当。
来自树结构监督的过程优势提供了更细粒度的信用分配，即使没有中间标签也提高了性能。
相似性基剪枝对维持探索多样性和实现稳健增益至关重要。
扩展树结构可通过降低 Monte Carlo 估计方差带来适度的额外增益。
基于连续性分析表明 TreePS-RAG 在纠正不完善推理前缀方面优于基线。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。