[论文解读] Training Multi-Turn Search Agent via Contrastive Dynamic Branch Sampling
BranPO 引入对比性尾部分支采样以改进在长远目标中的信用分配,利用尾部聚焦分支、难度感知采样和冗余步掩蔽,在无需额外训练预算的情况下实现强劲的多跳问答性能。
Agentic reinforcement learning has enabled large language models to perform complex multi-turn planning and tool use. However, learning in long-horizon settings remains challenging due to sparse, trajectory-level outcome rewards. While prior tree-based methods attempt to mitigate this issue, they often suffer from high variance and computational inefficiency. Through empirical analysis of search agents, We identify a common pattern: performance diverges mainly due to decisions near the tail. Motivated by this observation, we propose Branching Relative Policy Optimization (BranPO), a value-free method that provides step-level contrastive supervision without dense rewards. BranPO truncates trajectories near the tail and resamples alternative continuations to construct contrastive suffixes over shared prefixes, reducing credit ambiguity in long-horizon rollouts. To further boost efficiency and stabilize training, we introduce difficulty-aware branch sampling to adapt branching frequency across tasks, and redundant step masking to suppress uninformative actions. Extensive experiments on various question answering benchmarks demonstrate that BranPO consistently outperforms strong baselines, achieving significant accuracy gains on long-horizon tasks without increasing the overall training budget. Our code is available at \href{https://github.com/YubaoZhao/BranPO}{code}.
研究动机与目标
- Identify why late-stage decisions drive errors in long-horizon agentic search tasks.
- Develop a value-free branching method that provides step-level contrastive supervision without dense rewards.
- Improve training efficiency and stability via adaptive branching and masking of redundant steps.
- Demonstrate BranPO’s effectiveness on diverse multi-hop and web-search QA benchmarks.
提出的方法
- Propose Branching Relative Policy Optimization (BranPO), a value-free policy objective that distributes credit across shared prefixes and branched suffixes.
- Truncate trajectories at the tail and resample suffixes to create contrastive branches that differ in outcome (correct vs incorrect continuations).
- Compute base advantages for shared prefixes by averaging branch rewards and branch advantages for suffixes using normalized, group-wise statistics (GRPO-inspired).
- Introduce difficulty-aware branch sampling to allocate more branching budget to hard tasks or incorrect trajectories.
- Apply Redundant Step Masking (RSM) to suppress gradient signals from redundant late steps, reducing continuation bias.
- Provide theoretical connections showing BranPO combines stable GRPO gradients with Direct Preference Optimization (DPO)-like suffix updates.]
- research_questions: [
- Can tail-focused, contrastive branching provide more informative supervision than uniform trajectory-level signals in long-horizon tasks?
- How can branching frequency be adapted to task difficulty to improve sample efficiency without increasing total training budget?
- Does masking redundant tail steps improve stability and efficiency of learning in long-horizon agentic search?
- Do BranPO variants improve performance on multi-hop QA benchmarks and real-world web search tasks compared to strong baselines?

实验结果
研究问题
- RQ1Can tail-focused, contrastive branching provide more informative supervision than uniform trajectory-level signals in long-horizon tasks?
- RQ2How can branching frequency be adapted to task difficulty to improve sample efficiency without increasing total training budget?
- RQ3Does masking redundant tail steps improve stability and efficiency of learning in long-horizon agentic search?
- RQ4Do BranPO variants improve performance on multi-hop QA benchmarks and real-world web search tasks compared to strong baselines?
主要发现
- BranPO consistently outperforms strong baselines on multi-hop QA benchmarks, including improvements over GRPO, Tree-GRPO, and GiGPO.
- Branching from the trajectory tail with contrastive suffixes yields better credit assignment for late-stage decisions, improving learning signals.
- Difficulty-aware branch sampling concentrates computation on informative, hard instances and maintains efficiency.
- Redundant Step Masking reduces continuation bias by masking uninformative tail steps, stabilizing training.
- BranPO scales to longer horizons and generalizes to web-search scenarios, outperforming GRPO in GAIA results.

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。