QUICK REVIEW

[論文レビュー] TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training

Jinluan Yang, Yuxin Liu|arXiv (Cornell University)|Mar 2, 2026

Reinforcement Learning in Robotics被引用数 0

ひとこと要約

TopoCurate は相互作用トポロジーに基づくデータキュレーションを導入し、ツール利用エージェントの訓練を改善。SFTとRLのトポロジーガイド付きデータ選択により、Tau2 BenchとBFCLv3でベースラインに対して一貫した性能向上を実現。

ABSTRACT

Training tool-use agents typically relies on outcome-based filtering: Supervised Fine-Tuning (SFT) on successful trajectories and Reinforcement Learning (RL) on pass-rate-selected tasks. However, this paradigm ignores interaction dynamics: successful trajectories may lack error recovery or exhibit redundancy, while pass rates fail to distinguish structurally informative tasks from trivial ones. We propose extbf{TopoCurate}, an interaction-aware framework that projects multi-trial rollouts from the same task into a unified semantic quotient topology. By merging equivalent action-observation states, this projection transforms scattered linear trajectories into a structured manifold that explicitly captures how tool invocations and environmental responses drive the divergence between effective strategies and failure modes. Leveraging this representation, we introduce a dual-selection mechanism: for SFT, we prioritize trajectories demonstrating reflective recovery, semantic efficiency, and strategic diversity to mitigate covariate shift and mode collapse; for RL, we select tasks with high error branch ratios and strategic heterogeneity, maximizing gradient Signal-to-Noise Ratio to address vanishing signals in sparse-reward settings. Evaluations on BFCLv3 and Tau2 Bench show that TopoCurate achieves consistent gains of 4.2\% (SFT) and 6.9\% (RL) over state-of-the-art baselines. We will release the code and data soon for further investigations.

研究の動機と目的

データキュレーションを成果に基づく絞り込みから、エージェントと環境の相互作用のトポロジーモデリングへ移行する。
三つの SFT 指向トポロジー指標（Reflective Recovery, Semantic Efficiency, Distributional Diversity）を開発し、共変量シフトとモード崩壊を低減する。
二つの RL 指向タスク選択指標（Error Branch Ratio, Strategic Heterogeneity）を開発し、勾配信号対ノイズ比を最大化する。
多試行ローアウトを商としてのトポロジーへ射影する形式的枠組みを提供し、その実効性を実証的に示す。

提案手法

完全な相互作用ターンを ğhat{z}_t = (r_t, a_t, o_t) と定義し、意味的に同等なターンを商トポロジーにマージして状態上の DAG を得る。
三つの SFT 軌跡スコアリング指標：Reflective Recovery, Semantic Efficiency, Distributional Diversity を導入し、データを再重み付けする複合選択ウェイト w(tau) を計算する（ Eq. 7 ）。
構造指標として RL のタスク選択を定式化し、Error Branch Ratio と Strategic Heterogeneity を用い、高い SNR のタスクを優先する選択分布を定義する（ Eq. 11 ）。
トポロジー的再重み付けを KL 散長最小化と SFT における共変量シフトの低減、GRPO ベース RL における情報 Fisher 情報の最大化と結びつける理論的視点を提供する。
Tau2 Bench と BFCL v3 で評価し、成果ベースのベースラインと比較して Pass@k の総合的改善と一般化を示す。

Figure 1 : Overview of the TopoCurate Framework. Our method operates in three systematic stages: (Left) Topological Modeling transforms disjoint rollouts into a unified state-transition graph by defining states via action-observation tuples and aggregating semantically equivalent turns; (Middle) Tra

実験結果

リサーチクエスチョン

RQ1相互作用トポロジーのモデリングと同等の行動-観測状態の結合が因果構造とロバストな戦略を明らかにし、成果に基づく絞り込みでは見えなかったか。
RQ2SFT のトポロジー認識指標（Reflective Recovery, Semantic Efficiency, Distributional Diversity）はデータ品質を改善し、標準的な成果ベースの絞り込みと比べて共変量シフトを低減するか。
RQ3RL のトポロジー guided タスク選択指標（Error Branch Ratio, Strategic Heterogeneity）は勾配情報を最大化し、スパース報酬設定で訓練を加速させるか。
RQ4TopoCurate由来のデータキュレーション戦略は、内部の Tau2 Bench および外部の BFCL v3 ベンチマークで最先端ベースラインより測定可能な利益を生むか。
RQ5モデル規模（8B, 32B）および多様なドメイン（航空、リテール、通信）におけるトポロジカルデータキュレーションの実証的影響はどのようになるか。

主な発見

TopoCurate は Tau2 Benchmark（IID）および BFCL v3（OOD）で一貫して最先端ベースラインを上回る。
トポロジー認識型 SFT データ選択は、成果のみの絞り込みより高い Pass@k スコアとより良い一般化を示す。
トポロジー主導の RL タスク選択は、勾配情報（SNR）を高め、ドメインを越えたポリシー収束を改善する。
Reflective Recovery と Structural Complexity を主要な要因として性能向上を示すアブレーション研究が、Diversity と Efficiency が領域固有の利点を提供することを確認。
トレーニングダイナミクスの分析は、TopoCurate 強化モデルでポリシー反映の増加、効率の改善、戦略的可塑性の向上を示す。

TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。