QUICK REVIEW

[논문 리뷰] TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training

Jinluan Yang, Yuxin Liu|arXiv (Cornell University)|2026. 03. 02.

Reinforcement Learning in Robotics인용 수 0

한 줄 요약

TopoCurate는 도구 사용 에이전트 훈련을 개선하기 위해 상호작용 토폴로지 기반 데이터 큐레이션을 도입하고, 토폴로지 유도 데이터 선택을 SFT와 RL에 적용하여 Tau2 Bench와 BFCLv3에서 기준선 대비 일관된 성능 향상을 보인다.

ABSTRACT

Training tool-use agents typically relies on outcome-based filtering: Supervised Fine-Tuning (SFT) on successful trajectories and Reinforcement Learning (RL) on pass-rate-selected tasks. However, this paradigm ignores interaction dynamics: successful trajectories may lack error recovery or exhibit redundancy, while pass rates fail to distinguish structurally informative tasks from trivial ones. We propose extbf{TopoCurate}, an interaction-aware framework that projects multi-trial rollouts from the same task into a unified semantic quotient topology. By merging equivalent action-observation states, this projection transforms scattered linear trajectories into a structured manifold that explicitly captures how tool invocations and environmental responses drive the divergence between effective strategies and failure modes. Leveraging this representation, we introduce a dual-selection mechanism: for SFT, we prioritize trajectories demonstrating reflective recovery, semantic efficiency, and strategic diversity to mitigate covariate shift and mode collapse; for RL, we select tasks with high error branch ratios and strategic heterogeneity, maximizing gradient Signal-to-Noise Ratio to address vanishing signals in sparse-reward settings. Evaluations on BFCLv3 and Tau2 Bench show that TopoCurate achieves consistent gains of 4.2\% (SFT) and 6.9\% (RL) over state-of-the-art baselines. We will release the code and data soon for further investigations.

연구 동기 및 목표

데이터 큐레이션을 결과 기반 필터링에서 에이전트-환경 상호작용의 토폴로지 모델링으로 전환한다.
공변량 이동(covariate shift) 및 모드 붕괴를 줄이기 위해 세 가지 SFT 지향 토폴로지 지표(Reflective Recovery, Semantic Efficiency, Distributional Diversity)를 개발한다.
Gradient 신호 대 잡음 비를 극대화하기 위해 두 가지 RL 지향 작업 선택 지표(Error Branch Ratio, Strategic Heterogeneity)를 개발한다.
다중 시도 롤아웃을 몫 토폴로지로 투사하는 형식적 프레임워크를 제공하고 경험적으로 그 효과를 입증한다.

제안 방법

완전한 상호작용 턴을 ˆhat{z}_t = (r_t, a_t, o_t)로 정의하고 의미상 동등한 턴을 몫 토폴로지로 병합하여 상태에 대한 DAG를 얻는다.
세 가지 SFT 궤적 점수 측정치(Reflective Recovery, Semantic Efficiency, Distributional Diversity)를 도입하고 합성 선택 가중치 w(tau)를 계산하여 데이터를 재가중한다(Eq. 7).
구조적 지표를 사용하여 RL 작업 선택을 공식화한다: Error Branch Ratio와 Strategic Heterogeneity를 사용하고, 고-SNR 작업을 우선하는 선택 분포를 정의한다(Eq. 11).
토폴로지 재가중화와 KL 발산의 최소화 및 SFT에서의 감소된 공변량 이동, 그리고 GRPO 기반 강화학습에서의 최대화된 Fisher 정보 사이의 이론적 연결을 제공한다.
Tau2 Bench와 BFCL v3에서 평가하고 결과 기반 기준선과 비교하여 Pass@k 및 일반화에서 총괄적 개선을 보인다.

Figure 1 : Overview of the TopoCurate Framework. Our method operates in three systematic stages: (Left) Topological Modeling transforms disjoint rollouts into a unified state-transition graph by defining states via action-observation tuples and aggregating semantically equivalent turns; (Middle) Tra

실험 결과

연구 질문

RQ1상호작용 토폴로지 모델링과 동등한 행동-관찰 상태의 병합이 결과 기반 필터링에서 간과된 인과 구조와 견고한 전략을 드러낼 수 있는가?
RQ2SFT의 토폴로지 인식 지표(Reflective Recovery, Semantic Efficiency, Distributional Diversity)가 표준 결과 기반 필터링에 비해 데이터 품질을 향상시키고 공변량 이동을 줄이는가?
RQ3RL의 토폴로지 유도 작업 선택 지표(Error Branch Ratio, Strategic Heterogeneity)가 그래디언트 정보를 극대화하고 희소 보상 설정에서 학습을 가속화하는가?
RQ4내부 Tau2 Bench와 외부 BFCL v3 벤치마크에서 TopoCurate 유래 데이터 큐레이션 전략이 측정 가능한 이점을 제공하는가?
RQ5모델 규모(8B, 32B) 및 다양한 도메인(Airline, Retail, Telecom)에서 토폴로지 기반 데이터 큐레이션의 실증적 영향은 무엇인가?

주요 결과

TopoCurate는 Tau2 Benchmark (IID)와 BFCL v3 (OOD)에서 일관되게 최첨단 기준선보다 우수하다.
토폴로지 인식 SFT 데이터 선택은 결과만 기반의 필터링보다 높은 Pass@k 점수와 더 나은 일반화를 가져온다.
토폴로지 주도 RL 작업 선택은 더 높은 그래디언트 정보(SNR)와 도메인 전반에 걸친 정책 수렴 개선을 제공한다.
삭제 연구(Ablation studies)에서 Reflective Recovery와 Structural Complexity가 성능 향상의 핵심 기여 요인임을 확인했고, Diversity와 Efficiency는 도메인 특성에 따른 이점을 제공한다.
훈련 동역학 분석은 TopoCurate-강화 모델에서 정책 반영 증가, 효율성 향상, 더 큰 전략적 가소성을 보인다.

TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.