QUICK REVIEW

[論文レビュー] StreamVLA: Breaking the Reason-Act Cycle via Completion-State Gating

Chen, Tongqing, Wu, Hang|arXiv (Cornell University)|Feb 1, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

StreamVLA は、1つのバックボーンで遅い計画と速い行動を分離するデュアルシステムのゲート付きアーキテクチャを導入し、完了状態のイマジネーションを用いて推論をゲートし待機時間を低減しつつ長期的な操作性能を高める。

ABSTRACT

Long-horizon robotic manipulation requires bridging the gap between high-level planning (System 2) and low-level control (System 1). Current Vision-Language-Action (VLA) models often entangle these processes, performing redundant multimodal reasoning at every timestep, which leads to high latency and goal instability. To address this, we present StreamVLA, a dual-system architecture that unifies textual task decomposition, visual goal imagination, and continuous action generation within a single parameter-efficient backbone. We introduce a "Lock-and-Gated" mechanism to intelligently modulate computation: only when a sub-task transition is detected, the model triggers slow thinking to generate a textual instruction and imagines the specific visual completion state, rather than generic future frames. Crucially, this completion state serves as a time-invariant goal anchor, making the policy robust to execution speed variations. During steady execution, these high-level intents are locked to condition a Flow Matching action head, allowing the model to bypass expensive autoregressive decoding for 72% of timesteps. This hierarchical abstraction ensures sub-goal focus while significantly reducing inference latency. Extensive evaluations demonstrate that StreamVLA achieves state-of-the-art performance, with a 98.5% success rate on the LIBERO benchmark and robust recovery in real-world interference scenarios, achieving a 48% reduction in latency compared to full-reasoning baselines.

研究の動機と目的

長距離ロボティック操作における高レベル計画（System 2）と低レベル制御（System 1）のギャップを埋める。
サブタスク遷移時のみ重い推論をゲートすることで、 timestep ごとの多モーダル推論を削減する。
ゴールをグラウンドするための完了状態イマジネーションヘッドと、計算を調整する軽量ゲーティングモジュールを導入する。
LIBERO および RoboTwin 2.0 ベンチマークで、待機時間の改善とともに最先端の性能を示す。
動的で未来志向の制御ループを通じて実世界の撹乱に対するロバスト性を示す。

提案手法

perception、planning、control の間でパラメータを共有する統一型 Vision-Language-Action バックボーン。
完了状態画像をゴールアンカーとして使用し、System 2 計画をトリガーするタイミングを決定する Lock-and-Gated メカニズム。
Infinity ビットワイズ自己回帰モデルに基づく imagination ヘッドによるサブタスク完了状態の生成。
現在の観測とロックされた完了ゴールとの不一致スコアを計算して Skip Mode と Full Mode の切替を行うゲーティングモジュール。
ロックされた高レベル意図（テキスト計画と視覚的ゴール）に条件付けされた Flow Matching を用いたアクションヘッド。
Stage I ではバックボーンを凍結して imagination とサブタスクヘッドを整合、Stage II でエンドツーエンドの微調整を行う。

実験結果

リサーチクエスチョン

RQ1統一された VLA バックボーンは、各ステップの自己回帰デコードを完全に回避しつつ高速な制御と遅い計画の両方をサポートできるか？
RQ2サブタスク完了状態の予測は、固定時間の未来フレームよりも安定した速度不変の視覚アンカーを提供するか？
RQ3軽量ゲーティング機構は、長期的なタスク成功を保ちつつ待機時間をどれだけ削減できるか？
RQ4テキスト計画と視覚的イマジネーションの両方を含めることが性能とロバスト性に与える影響は？

主な発見

方法	スケール	パラメータ数（B）	空間	物体	ゴール	長期	平均
FlowVLA	Large	8.5	93.2	95.0	91.6	72.6	88.1
UnifiedVLA	8.5	95.4	98.8	93.6	94.0	95.5
OpenVLA	7	84.7	88.4	79.2	53.7	76.5
OpenVLA-OFT	7	97.6	98.4	97.9	94.5	97.1
UniVLA	7	96.5	96.8	95.6	92.0	95.2
CoT-VLA	7	87.5	91.6	87.6	69.0	81.1
WorldVLA	7	87.6	96.2	83.4	60.0	81.8
ThinkAct	7	88.3	91.4	87.1	70.9	84.4
MemoryVLA	7	98.4	98.4	96.4	95.6	96.5
4D-VLA	4	88.9	95.2	90.9	79.1	88.6
SpatialVLA	4	88.2	89.9	78.6	55.5	78.1
π0	3	96.8	98.8	95.8	85.2	94.2
π0-FAST	3	96.4	96.8	88.6	60.2	85.5
StreamVLA	Medium	3	99.2	99.4	98.6	96.6	98.5

LIBERO で平均成功率 98.5% を達成し、パラメータ数が従来より少ない（StreamVLA: 3B 対 7B+）。
LIBERO-Long の成功率 96.6% を維持し、長期計画へのロバスト性を示す。
完全な推論ベースラインと比べて平均待機時間を 48% 減少（244 ms から 128 ms）。
RoboTwin 2.0 の hard セッティングで平均成功率 37.2% を示し、ドメインランダム化下で強力なベースラインを上回る。
アブレーションによりゲーティングはパレート最適な速度/精度を提供することを示し、テキスト計画と視覚的イマジネーションの両方が寄与することを示す。
固定ステップ予測（t+Δt）は完了状態予測と比較して劣る。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。