QUICK REVIEW

[論文レビュー] Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Ailin Huang, Ang Li|arXiv (Cornell University)|Feb 11, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

ステップ3.5 Flash は 196B の総パラメータ MoE のうち 11B が活性パラメータで、ハイブリッドアテンション、MTP、MIS-PO RL を使用して低遅延で frontier レベルの推論とエージェント能力を実現します。

ABSTRACT

We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with 11B active parameters for efficient inference. It is optimized with interleaved 3:1 sliding-window/full attention and Multi-Token Prediction (MTP-3) to reduce the latency and cost of multi-round agentic interactions. To reach frontier-level intelligence, we design a scalable reinforcement learning framework that combines verifiable signals with preference feedback, while remaining stable under large-scale off-policy training, enabling consistent self-improvement across mathematics, code, and tool use. Step 3.5 Flash demonstrates strong performance across agent, coding, and math tasks, achieving 85.4% on IMO-AnswerBench, 86.4% on LiveCodeBench-v6 (2024.08-2025.05), 88.2% on tau2-Bench, 69.0% on BrowseComp (with context management), and 51.0% on Terminal-Bench 2.0, comparable to frontier models such as GPT-5.2 xHigh and Gemini 3.0 Pro. By redefining the efficiency frontier, Step 3.5 Flash provides a high-density foundation for deploying sophisticated agents in real-world industrial environments.

研究の動機と目的

オープンソースモデルの frontier 級エージェント知能と計算効率の橋渡しを図る。
複数ラウンドのエージェント対話において強力な推論と高速・信頼性ある実行を達成する。
長期的な訓練中の安定性を維持するスケーラブルな事後学習 RL フレームワークを開発する。
11B 活性パラメータで数学・コーディング・ツール活用のベンチマークで競争力のある性能を示す。

提案手法

11B 活性パラメータを各トークンあたり持つ 196B 総パラメータのスパース MoE バックボーンを使用。
長文文脈の効率を改善するため、Head-wise gating を含む 3:1 Sliding Window/Full Attention ハイブリッド配置（S3F1）を採用。
推測デコードを可能にし自己回帰遅延を低減するために Multi-Token Prediction (MTP-3) ヘッドを導入。
負荷不均衡とエキスパート崩壊を緩和するために MoE ルーティングと EP-グループバランシングを組み合わせ。
長期的なエージェント的タスクでの安定性とスケーラビリティを確保する MIS-PO（Metropolis Independence Sampling-Filtered Policy Optimization）を採用。
単一のジェネラリストを維持するためにドメイン特化の専門化とグローバル統合を交互に行う事後訓練レシピを提供。

実験結果

リサーチクエスチョン

RQ111B 活性パラメータ構成は推論とエージェントタスクで frontier モデルと同等の能力を発揮できるか？
RQ2長-context のエージェントワークロードにおいて、レイテンシと性能の最も良いトレードオフを生むアーキテクチャ選択（アテンション配置、ゲーティング、MTP）はどれか？
RQ3統一的な事後訓練 RL フレームワーク（MIS-PO）は長期的なエージェント推論へ拡張して安定性を保てるか？
RQ4大規模なスパース MoE 訓練の安定性課題と緩和策は何か、どう監視するか？
RQ5Step 3.5 Flash は数学・コーディング・ツール使用のベンチマークで主要な frontier システムと比べてどうか？

主な発見

Layout	SWA Heads	Rel. FLOPs	Pre-train Avg.	Decode/Prefill	Reasoning	Math	Code	Sci	General	LongCtx
FFFF	32	~2.68 / 2.90	54.1	40.8	40.9	19.6	42.7	26.5	28.8	33.2
S1F1	32	~1.58 / 1.65	54.6	42.1	42.3	19.3	44.5	26.8	29.6	34.1
S3F1	32	~1.00 / 1.00	53.6	40.2	40.4	18.9	42.4	25.4	27.5	32.5
S3F1+Head	48	~1.01 / 1.02	55.7	40.6	40.3	18.3	44.0	26.0	28.2	32.9

Step 3.5 Flash は 11B 活性パラメータで推論とツール活用ベンチマークの競争力のある性能を達成。
IMO-AnswerBench で 85.4%、LiveCodeBench-v6 で 86.4% を記録。
tau2-Bench で 88.2%、BrowseComp（コンテキスト管理あり）で 69.0%、Terminal-Bench 2.0 で 51.0% を達成。
GPT-5.2 xHigh および Gemini 3.0 Pro に近い frontier レベルの性能を複数タスクで発揮。
SWA と head-wise gating を組み合わせた MTP はレイテンシを低減しつつ品質を維持・向上。
MIS-PO は長期的な推論のためのスケーラブル RL を可能にし、勾配分散を低減し安定性を改善。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。