QUICK REVIEW

[論文レビュー] Training Multi-Turn Search Agent via Contrastive Dynamic Branch Sampling

Yubao Zhao, Weiquan Huang|arXiv (Cornell University)|Feb 3, 2026

Topic Modeling被引用数 0

ひとこと要約

BranPOはコントラスト的ダイナミックブランチサンプリングを導入し、尾部に焦点を当てたブランチング、難易度認識型サンプリング、冗長ステップマスキングを用いて長期的エージェント指向強化学習の信用割り当てを改善。追加トレーニング予算なしで多跳Q&A性能を強化。

ABSTRACT

Agentic reinforcement learning has enabled large language models to perform complex multi-turn planning and tool use. However, learning in long-horizon settings remains challenging due to sparse, trajectory-level outcome rewards. While prior tree-based methods attempt to mitigate this issue, they often suffer from high variance and computational inefficiency. Through empirical analysis of search agents, We identify a common pattern: performance diverges mainly due to decisions near the tail. Motivated by this observation, we propose Branching Relative Policy Optimization (BranPO), a value-free method that provides step-level contrastive supervision without dense rewards. BranPO truncates trajectories near the tail and resamples alternative continuations to construct contrastive suffixes over shared prefixes, reducing credit ambiguity in long-horizon rollouts. To further boost efficiency and stabilize training, we introduce difficulty-aware branch sampling to adapt branching frequency across tasks, and redundant step masking to suppress uninformative actions. Extensive experiments on various question answering benchmarks demonstrate that BranPO consistently outperforms strong baselines, achieving significant accuracy gains on long-horizon tasks without increasing the overall training budget. Our code is available at \href{https://github.com/YubaoZhao/BranPO}{code}.

研究の動機と目的

長期的なエージェント指向探索タスクにおいて、遅い段階の意思決定が誤りを生む理由を特定する。
密な報酬なしでステップレベルの対比監視を提供する値を持たない分岐法を開発する。
適応的なブランチングと冗長ステップのマスキングによって学習効率と安定性を向上させる。
BranPOの多様な多跳およびウェブ検索Q&Aベンチマークにおける有効性を示す。

提案手法

ブランチング相対的ポリシー最適化（BranPO）を提案する。これは、信用を共有プレフィックスと分岐後継辞の間で分配する値を持たないポリシー目的である。
尾部で軌道を切り捨て、継続結果が異なるコントラスト的ブランチ（正しい継続 vs 不正な継続）を作成するためにサフィックスを再サンプリングする。
共有プレフィックスにはブランチ報酬を平均化して基盤優位性を算出し、サフィックスには正規化されたグループ統計（GRPO風）を用いてブランチ優位性を算出する。
難易度認識型ブランチサンプリングを導入し、難易度が高いタスクや不正確な軌道に対してより多くのブランチ予算を割り当てる。
冗長ステップマスキング（RSM）を適用して冗長な後半ステップからの勾配信号を抑制し、継続バイアスを低減する。
BranPOが安定したGRPO勾配とDirect Preference Optimization（DPO）風のサフィックス更新を組み合わせる理論的関連を提供する。

Figure 1 : Comparison between GRPO, tree-based GRPO, and BranPO. Yellow nodes denote intermediate steps; green and red nodes indicate correct and incorrect answers. GRPO samples from the trajectory start, which is inefficient because SFT-trained models tend to produce highly similar prefixes. Tree-b

実験結果

リサーチクエスチョン

RQ1尾部に焦点を当てた対比的ブランチングは、長期的タスクにおいて一様な軌道レベル信号よりも情報量の多い監視を提供できるか。
RQ2タスク難易度に適応したブランチ頻度は、総トレーニング予算を増やさずにサンプル効率を改善できるか。
RQ3冗長な尾部ステップのマスキングは、長期エージェント指向探索における学習の安定性と効率を改善するか。
RQ4BranPOの派生が多跳Q&Aベンチマークと実世界のウェブ検索タスクで、強力なベースラインと比較して性能を向上させるか。

主な発見

BranPOは、多跳Q&Aベンチマークで一貫して強力なベースラインを上回り、GRPO、Tree-GRPO、GiGPOを含む改善を示す。
軌道の尾部からコントラスト的サフィックスを用いてブランチングすると、遅期意思決定の信用割り当てが改善され、学習信号が向上する。
難易度認識型ブランチサンプリングは情報量が多く難易度の高い事例に計算を集中させ、効率を維持する。
冗長ステップマスキングは継続バイアスを抑制し、トレーニングを安定化させる。
BranPOは長い時間スケールへ拡張可能でウェブ検索シナリオへ一般化し、GAIA結果でGRPOを上回る。

Figure 2 : Overview of BranPO. After the initial rollout, group accuracy is computed and branching budgets are assigned based on task accuracy and trajectory reward. Simple branching is applied to correct trajectories in easy tasks, while recursive branching is used for hard tasks or incorrect traje

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。