QUICK REVIEW

[论文解读] ASTER: Agentic Scaling with Tool-integrated Extended Reasoning

Xuqin Zhang, Quan (Sophia) He|arXiv (Cornell University)|Feb 1, 2026

Topic Modeling被引用 0

一句话总结

ASTER 引入一种冷启动策略，通过密集交互的工具使用轨迹防止交互崩溃，并在集成工具的强化学习中实现可扩展的代理式推理，使用4B模型在数学基准上达到最先进水平。

ABSTRACT

Reinforcement learning (RL) has emerged as a dominant paradigm for eliciting long-horizon reasoning in Large Language Models (LLMs). However, scaling Tool-Integrated Reasoning (TIR) via RL remains challenging due to interaction collapse: a pathological state where models fail to sustain multi-turn tool usage, instead degenerating into heavy internal reasoning with only trivial, post-hoc code verification. We systematically study three questions: (i) how cold-start SFT induces an agentic, tool-using behavioral prior, (ii) how the interaction density of cold-start trajectories shapes exploration and downstream RL outcomes, and (iii) how the RL interaction budget affects learning dynamics and generalization under varying inference-time budgets. We then introduce ASTER (Agentic Scaling with Tool-integrated Extended Reasoning), a framework that circumvents this collapse through a targeted cold-start strategy prioritizing interaction-dense trajectories. We find that a small expert cold-start set of just 4K interaction-dense trajectories yields the strongest downstream performance, establishing a robust prior that enables superior exploration during extended RL training. Extensive evaluations demonstrate that ASTER-4B achieves state-of-the-art results on competitive mathematical benchmarks, reaching 90.0% on AIME 2025, surpassing leading frontier open-source models, including DeepSeek-V3.2-Exp.

研究动机与目标

研究冷启动 SFT 设计如何塑造下游工具使用行为及其在 RL 下的表现。
考察冷启动轨迹的交互密度如何影响探索与 RL 结果。
评估在不同推理预算下，RL 交互预算对学习动态和测试时表现的影响。
证明密集、长时程的冷启动先验能够实现更优的代理式扩展和工具整合。

提出的方法

使用 GPT-OSS-20B 合成带工具的轨迹，并精选一个4K条交互密集轨迹的专家冷启动数据集。
采用两阶段冷启动 SFT，随后进行带组相对策略优化（GRPO）的强化学习。
比较多种冷启动策略（Zero、ZeroForceTool、ReTool、DemyAgent、ASTER）以研究行为先验。
改变交互密度和 RL 预算以分析对探索、工具使用及最终表现的影响。
在指定的解码设置下，在竞赛性数学基准（AIME2024、AIME2025、HMMT2025、BeyondAIME）上进行评估。
报告训练动态，包括工具调用频率和熵，以理解代理式扩展行为。

Figure 1 : ASTER demonstrates remarkable efficiency, surpassing much larger and stronger models on the challenging AIME 2025 benchmark. It achieves a score of 90.0, outperforming DeepSeek-V3.2-exp (89.3/671B).

实验结果

研究问题

RQ1RQ1：冷启动 SFT 设计如何形塑诱导的工具使用行为先验及其下游 RL 表现？
RQ2RQ2：冷启动轨迹的交互密度如何影响探索与 RL 结果？
RQ3RQ3：在不同推理预算下，RL 交互预算如何影响学习动态和测试时表现？

主要发现

模型	AIME2024	AIME2025	HMMT2025	BeyondAIME	avg@16
OpenReasoning-Nemotron-7B	84.7	78.2	63.5	–	–
Qwen3-235B-A22B-Thinking	85.7	81.5	62.5	–	–
POLARIS-4B-Preview	81.2	79.4	58.7	–	–
ReTool-32B	72.5	54.3	–	–	–
rStar2-Agent-14B	80.6	69.8	52.7	–	–
DemyAgent-4B	72.6	70.0	52.9	†	35.3
ASTER-1.7B-SFT	19.4	19.0	11.3	6.4	–
ASTER-1.7B	64.6	59.6	47.5	26.3	–
ASTER-4B-SFT	62.5	54.6	43.3	27.4	–
ASTER-4B	82.3	85.0	73.3	53.9	–
ASTER-4B w/ 90K Inference Budget	85.8	90.0	77.1	61.7	–

一个小规模、交互密集的冷启动集（4K 条轨迹，>9 次工具交互）带来最强的下游表现。
交互密度是冷启动先验的关键属性，在 RL 期间维持探索，防止交互崩溃。
更高的训练时交互预算在推理预算较大时提升测试时的扩展性，而更严格的推理预算更有利于在受限交互预算下训练的模型。
ASTER-4B 在数学基准上实现了最先进的结果，尤其是在 AIME2025 上达到 85.0%（推理预算 90K 时达到 90.0%），超越了更大模型。
在 90K 推理预算下，ASTER-4B 在 AIME2025 达到 90.0%，在 HMMT2025 达到 77.1%，在 BeyondAIME 达到 61.7%，超越了若干更大的基线模型。
训练动态显示冷启动后初期性能下降，随后在 RL 进展中实现恢复并获得更优的长时程工具使用。

(a) Tool call count distribution across different cold-start datasets.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。