QUICK REVIEW

[论文解读] Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Ailin Huang, Ang Li|arXiv (Cornell University)|Feb 11, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

步骤 3.5 Flash 是一个 196B 总参数的稀疏 MoE 模型，拥有 11B 活跃参数，在混合注意力、MTP 与 MIS-PO RL 的帮助下实现前沿级推理与代理能力，且延迟低。

ABSTRACT

We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with 11B active parameters for efficient inference. It is optimized with interleaved 3:1 sliding-window/full attention and Multi-Token Prediction (MTP-3) to reduce the latency and cost of multi-round agentic interactions. To reach frontier-level intelligence, we design a scalable reinforcement learning framework that combines verifiable signals with preference feedback, while remaining stable under large-scale off-policy training, enabling consistent self-improvement across mathematics, code, and tool use. Step 3.5 Flash demonstrates strong performance across agent, coding, and math tasks, achieving 85.4% on IMO-AnswerBench, 86.4% on LiveCodeBench-v6 (2024.08-2025.05), 88.2% on tau2-Bench, 69.0% on BrowseComp (with context management), and 51.0% on Terminal-Bench 2.0, comparable to frontier models such as GPT-5.2 xHigh and Gemini 3.0 Pro. By redefining the efficiency frontier, Step 3.5 Flash provides a high-density foundation for deploying sophisticated agents in real-world industrial environments.

研究动机与目标

在开源模型中桥接前沿级代理智能与计算效率。
在多轮代理交互中实现强推理与快速、可靠执行。
开发可扩展的后训练 RL 框架，在长时间训练中保持稳定性。
在数学、编程与工具基准上以 11B 活跃参数展现有竞争力的性能。

提出的方法

使用一个 196B 总参数的稀疏 MoE 主干，每个词元拥有 11B 活动参数。
采用 3:1 的滑动窗口/全注意力混合布局（S3F1），配合头级门控注意力以提升长上下文效率。
引入多词预测（MTP-3）头以实现推测解码并降低自回归时延。
在 MoE 路由与 EP-组平衡之间取得平衡以缓解负载不均与专家崩溃。
采用 MIS-PO（Metropolis Independence Sampling-Filtered Policy Optimization）实现长时间代理任务的可扩展、稳定 RL。
提供一个后训练方案，交替领域特化与全局综合以保持单一通用性。

实验结果

研究问题

RQ111B 活跃参数配置如何在推理与代理任务上达到与前沿模型相当的水平？
RQ2哪些架构选择（注意力布局、门控、MTP）在长上下文代理工作负载中实现延迟与性能的最佳权衡？
RQ3一个统一的后训练 RL 框架（MIS-PO）是否能在保持稳定性的同时扩展到长时程的代理推理？
RQ4大规模稀疏 MoE 训练的稳定性挑战及缓解措施有哪些，如何监控？
RQ5Step 3.5 Flash 在数学、编码与工具使用基准上的表现相比领先的前沿系统如何？

主要发现

Layout	SWA Heads	Rel. FLOPs	Pre-train Avg.	Decode/Prefill	Reasoning	Math	Code	Sci	General	LongCtx
FFFF	32	~2.68 / 2.90	54.1	40.8	40.9	19.6	42.7	26.5	28.8	33.2
S1F1	32	~1.58 / 1.65	54.6	42.1	42.3	19.3	44.5	26.8	29.6	34.1
S3F1	32	~1.00 / 1.00	53.6	40.2	40.4	18.9	42.4	25.4	27.5	32.5
S3F1+Head	48	~1.01 / 1.02	55.7	40.6	40.3	18.3	44.0	26.0	28.2	32.9

Step 3.5 Flash 在 11B 活跃参数条件下，在推理与工具增强基准上展现出竞争性表现。
在 IMO-AnswerBench 的得分为 85.4%，在 LiveCodeBench-v6 的得分为 86.4%。
在 tau2-Bench 取得 88.2%，BrowseComp（带上下文管理） 69.0%，Terminal-Bench 2.0 51.0%。
模型在若干任务上达到与 GPT-5.2 xHigh 与 Gemini 3.0 Pro 相当的前沿级性能。
MTP 结合 SWA 与头级门控在降低时延的同时保持或提升质量。
MIS-PO 使得长时程推理的 RL 可扩展且降低梯度方差、提升稳定性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。