Skip to main content
QUICK REVIEW

[Paper Review] Training Multi-Turn Search Agent via Contrastive Dynamic Branch Sampling

Yubao Zhao, Weiquan Huang|arXiv (Cornell University)|Feb 3, 2026
Topic Modeling0 citations
TL;DR

BranPO introduces contrastive dynamic branch sampling to improve credit assignment in long-horizon agentic RL, using tail-focused branching, difficulty-aware sampling, and redundant step masking, achieving strong multi-hop QA performance without extra training budget.

ABSTRACT

Agentic reinforcement learning has enabled large language models to perform complex multi-turn planning and tool use. However, learning in long-horizon settings remains challenging due to sparse, trajectory-level outcome rewards. While prior tree-based methods attempt to mitigate this issue, they often suffer from high variance and computational inefficiency. Through empirical analysis of search agents, We identify a common pattern: performance diverges mainly due to decisions near the tail. Motivated by this observation, we propose Branching Relative Policy Optimization (BranPO), a value-free method that provides step-level contrastive supervision without dense rewards. BranPO truncates trajectories near the tail and resamples alternative continuations to construct contrastive suffixes over shared prefixes, reducing credit ambiguity in long-horizon rollouts. To further boost efficiency and stabilize training, we introduce difficulty-aware branch sampling to adapt branching frequency across tasks, and redundant step masking to suppress uninformative actions. Extensive experiments on various question answering benchmarks demonstrate that BranPO consistently outperforms strong baselines, achieving significant accuracy gains on long-horizon tasks without increasing the overall training budget. Our code is available at \href{https://github.com/YubaoZhao/BranPO}{code}.

Motivation & Objective

  • Identify why late-stage decisions drive errors in long-horizon agentic search tasks.
  • Develop a value-free branching method that provides step-level contrastive supervision without dense rewards.
  • Improve training efficiency and stability via adaptive branching and masking of redundant steps.
  • Demonstrate BranPO’s effectiveness on diverse multi-hop and web-search QA benchmarks.

Proposed method

  • Propose Branching Relative Policy Optimization (BranPO), a value-free policy objective that distributes credit across shared prefixes and branched suffixes.
  • Truncate trajectories at the tail and resample suffixes to create contrastive branches that differ in outcome (correct vs incorrect continuations).
  • Compute base advantages for shared prefixes by averaging branch rewards and branch advantages for suffixes using normalized, group-wise statistics (GRPO-inspired).
  • Introduce difficulty-aware branch sampling to allocate more branching budget to hard tasks or incorrect trajectories.
  • Apply Redundant Step Masking (RSM) to suppress gradient signals from redundant late steps, reducing continuation bias.
  • Provide theoretical connections showing BranPO combines stable GRPO gradients with Direct Preference Optimization (DPO)-like suffix updates.
Figure 1 : Comparison between GRPO, tree-based GRPO, and BranPO. Yellow nodes denote intermediate steps; green and red nodes indicate correct and incorrect answers. GRPO samples from the trajectory start, which is inefficient because SFT-trained models tend to produce highly similar prefixes. Tree-b
Figure 1 : Comparison between GRPO, tree-based GRPO, and BranPO. Yellow nodes denote intermediate steps; green and red nodes indicate correct and incorrect answers. GRPO samples from the trajectory start, which is inefficient because SFT-trained models tend to produce highly similar prefixes. Tree-b

Experimental results

Research questions

  • RQ1Can tail-focused, contrastive branching provide more informative supervision than uniform trajectory-level signals in long-horizon tasks?
  • RQ2How can branching frequency be adapted to task difficulty to improve sample efficiency without increasing total training budget?
  • RQ3Does masking redundant tail steps improve stability and efficiency of learning in long-horizon agentic search?
  • RQ4Do BranPO variants improve performance on multi-hop QA benchmarks and real-world web search tasks compared to strong baselines?

Key findings

  • BranPO consistently outperforms strong baselines on multi-hop QA benchmarks, including improvements over GRPO, Tree-GRPO, and GiGPO.
  • Branching from the trajectory tail with contrastive suffixes yields better credit assignment for late-stage decisions, improving learning signals.
  • Difficulty-aware branch sampling concentrates computation on informative, hard instances and maintains efficiency.
  • Redundant Step Masking reduces continuation bias by masking uninformative tail steps, stabilizing training.
  • BranPO scales to longer horizons and generalizes to web-search scenarios, outperforming GRPO in GAIA results.
Figure 2 : Overview of BranPO. After the initial rollout, group accuracy is computed and branching budgets are assigned based on task accuracy and trajectory reward. Simple branching is applied to correct trajectories in easy tasks, while recursive branching is used for hard tasks or incorrect traje
Figure 2 : Overview of BranPO. After the initial rollout, group accuracy is computed and branching budgets are assigned based on task accuracy and trajectory reward. Simple branching is applied to correct trajectories in easy tasks, while recursive branching is used for hard tasks or incorrect traje

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.