QUICK REVIEW

[논문 리뷰] Training Multi-Turn Search Agent via Contrastive Dynamic Branch Sampling

Yubao Zhao, Weiquan Huang|arXiv (Cornell University)|2026. 02. 03.

Topic Modeling인용 수 0

한 줄 요약

BranPO는 꼬리 집중 대조 분기 샘플링을 도입하여 장기 지향 에이전트 RL에서 신용 할당을 개선하고, 꼬리 집중 분기, 난이도 인지 샘플링, 중복 단계 마스킹을 활용해 추가 학습 예산 없이도 강력한 다단계 QA 성능을 달성한다.

ABSTRACT

Agentic reinforcement learning has enabled large language models to perform complex multi-turn planning and tool use. However, learning in long-horizon settings remains challenging due to sparse, trajectory-level outcome rewards. While prior tree-based methods attempt to mitigate this issue, they often suffer from high variance and computational inefficiency. Through empirical analysis of search agents, We identify a common pattern: performance diverges mainly due to decisions near the tail. Motivated by this observation, we propose Branching Relative Policy Optimization (BranPO), a value-free method that provides step-level contrastive supervision without dense rewards. BranPO truncates trajectories near the tail and resamples alternative continuations to construct contrastive suffixes over shared prefixes, reducing credit ambiguity in long-horizon rollouts. To further boost efficiency and stabilize training, we introduce difficulty-aware branch sampling to adapt branching frequency across tasks, and redundant step masking to suppress uninformative actions. Extensive experiments on various question answering benchmarks demonstrate that BranPO consistently outperforms strong baselines, achieving significant accuracy gains on long-horizon tasks without increasing the overall training budget. Our code is available at \href{https://github.com/YubaoZhao/BranPO}{code}.

연구 동기 및 목표

장기 지향 에이전트 검색 과제에서 말단 단계의 의사결정이 왜 오류를 유발하는지 식별한다.
밀집한 보상 없이 단계 수준의 대조적 감독을 제공하는 가치 비의존 분기 방법을 개발한다.
적응적 분기 및 중복된 단계의 마스킹을 통해 학습 효율성과 안정성을 향상시킨다.
다양한 다단계 및 웹 검색 QA 벤치마크에서 BranPO의 효과를 입증한다.

제안 방법

공유된 접두사와 분기된 접미사 간에 신용을 분산시키는 가치 비의존 정책 목표인 Branching Relative Policy Optimization (BranPO)을 제안한다.
꼬리에서 궤적을 잘라내고 접미사를 재샘플링하여 결과(정답 연속 vs 오답 연속)가 다른 대조적 분기를 생성한다.
공유된 접두사에 대해 분기 보상을 평균화하고 접미사에 대한 분기 이점을 정규화된 그룹별 통계(GRPO-영감)를 사용하여 계산한다.
어려운 작업이나 잘못된 궤적에 더 많은 분기 예산을 할당하기 위해 난이도 인지 분기 샘플링을 도입한다.
중복된 말단 단계로부터의 그래디언트 신호를 억제하기 위해 Redundant Step Masking (RSM)을 적용하여 연속성 바이어스를 감소시킨다.
BranPO가 안정적인 GRPO 그래디언트를 Direct Preference Optimization (DPO)와 유사한 접미사 업데이트와 결합한다는 이론적 연결을 제공한다.

Figure 1 : Comparison between GRPO, tree-based GRPO, and BranPO. Yellow nodes denote intermediate steps; green and red nodes indicate correct and incorrect answers. GRPO samples from the trajectory start, which is inefficient because SFT-trained models tend to produce highly similar prefixes. Tree-b

실험 결과

연구 질문

RQ1꼬리 집중 대조 분기가 장기 지향 과제에서 균일한 궤적 수준 신호보다 더 정보적인 감독을 제공할 수 있는가?
RQ2총 학습 예산을 늘리지 않으면서 샘플 효율성을 개선하기 위해 분기 빈도를 작업 난이도에 맞게 어떻게 조정할 수 있는가?
RQ3중복된 말단 단계를 마스킹하는 것이 장기 지향 에이전트 검색에서 학습의 안정성과 효율성을 향상시키는가?
RQ4강력한 비교 기준과 비교했을 때 BranPO 변형이 다단계 QA 벤치마크와 실제 웹 검색 작업에서 성능을 향상시키는가?

주요 결과

BranPO는 GRPO, Tree-GRPO, GiGPO를 포함한 다단계 QA 벤치마크에서 강력한 기준선보다 일관되게 우수한 성능을 보인다.
궤적 꼬리에서 대조적 접미사를 갖는 분기가 말단 의사결정에 대해 더 나은 신용 할당을 제공하고 학습 신호를 향상시킨다.
난이도 인지 분기 샘플링은 정보가 풍부하고 어려운 사례에 계산을 집중시키고 효율성을 유지한다.
중복된 단계 마스킹은 비정보적 꼬리 단계를 마스킹하여 연속성 바이어스를 줄이고 학습을 안정화한다.
BranPO는 더 긴 지평선으로 확장되고 웹 검색 시나리오에 일반화되며 GAIA 결과에서 GRPO를 능가한다.

Figure 2 : Overview of BranPO. After the initial rollout, group accuracy is computed and branching budgets are assigned based on task accuracy and trajectory reward. Simple branching is applied to correct trajectories in easy tasks, while recursive branching is used for hard tasks or incorrect traje

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.