Skip to main content
QUICK REVIEW

[论文解读] StreamVLA: Breaking the Reason-Act Cycle via Completion-State Gating

Chen, Tongqing, Wu, Hang|arXiv (Cornell University)|Feb 1, 2026
Multimodal Machine Learning Applications被引用 0
一句话总结

StreamVLA 引入一个双系统、门控的架构,将慢速规划与快速行动在单一骨干中分离,使用完成状态的想象来门控推理并降低延迟,同时在长时距操控任务上取得出色绩效。

ABSTRACT

Long-horizon robotic manipulation requires bridging the gap between high-level planning (System 2) and low-level control (System 1). Current Vision-Language-Action (VLA) models often entangle these processes, performing redundant multimodal reasoning at every timestep, which leads to high latency and goal instability. To address this, we present StreamVLA, a dual-system architecture that unifies textual task decomposition, visual goal imagination, and continuous action generation within a single parameter-efficient backbone. We introduce a "Lock-and-Gated" mechanism to intelligently modulate computation: only when a sub-task transition is detected, the model triggers slow thinking to generate a textual instruction and imagines the specific visual completion state, rather than generic future frames. Crucially, this completion state serves as a time-invariant goal anchor, making the policy robust to execution speed variations. During steady execution, these high-level intents are locked to condition a Flow Matching action head, allowing the model to bypass expensive autoregressive decoding for 72% of timesteps. This hierarchical abstraction ensures sub-goal focus while significantly reducing inference latency. Extensive evaluations demonstrate that StreamVLA achieves state-of-the-art performance, with a 98.5% success rate on the LIBERO benchmark and robust recovery in real-world interference scenarios, achieving a 48% reduction in latency compared to full-reasoning baselines.

研究动机与目标

  • 缩小高层规划(System 2)与低层控制(System 1)在长时距机器人操控中的差距。
  • 通过在子任务转换处才进行繁重推理来消除每一步的多模态冗余推理。
  • 引入一个完成状态想象头用于目标对齐,以及一个轻量级门控模块以调节计算。
  • 在 LIBERO 和 RoboTwin 2.0 基准测试上展示最先进性能并提升时延。
  • 通过一个动态、前瞻驱动的控制回路,展示对真实世界干扰的鲁棒性。

提出的方法

  • 统一的 Vision-Language-Action(VLA)骨干,在感知、规划与控制之间共享参数。
  • Lock-and-Gated 机制,使用完成状态图像作为目标锚点并决定何时触发 System 2 规划。
  • 基于 Infinity 位元自回归模型的想象头,用以生成子任务完成状态。
  • 门控模块在当前观测与锁定的完成目标之间计算差异分数,以在 Skip 模式与 Full 模式之间切换。
  • 行动头在锁定的高级意图(文本计划与视觉目标)条件下使用 Flow Matching。
  • 两阶段课程:阶段 I 将想象与子任务头对齐,骨干冻结;阶段 II 进行端到端微调。

实验结果

研究问题

  • RQ1统一的 VLA 骨干是否能在不进行逐步自回归解码的情况下同时支持快速控制与慢速规划?
  • RQ2预测子任务完成状态是否比固定时间的未来帧提供更稳定、速度不变的视觉锚点?
  • RQ3轻量级门控机制在减少时延的同时是否能保留长时距任务的成功率?
  • RQ4同时包含文本规划与视觉想象对性能与鲁棒性有何影响?

主要发现

MethodScaleParams (B)SpatialObjectGoalLongAverage
FlowVLALarge8.593.295.091.672.688.1
UnifiedVLA8.595.498.893.694.095.5
OpenVLA784.788.479.253.776.5
OpenVLA-OFT797.698.497.994.597.1
UniVLA796.596.895.692.095.2
CoT-VLA787.591.687.669.081.1
WorldVLA787.696.283.460.081.8
ThinkAct788.391.487.170.984.4
MemoryVLA798.498.496.495.696.5
4D-VLA488.995.290.979.188.6
SpatialVLA488.289.978.655.578.1
π0396.898.895.885.294.2
π0-FAST396.496.888.660.285.5
StreamVLAMedium399.299.498.696.698.5
  • 在 LIBERO 上实现 98.5% 的平均成功率,优于先前方法且参数量更少(StreamVLA:3B vs. 7B+)。
  • 保持 LIBERO-Long 的 96.6% 成功率,显示对长时距规划的鲁棒性。
  • 相比全推理基线,平均时延降低 48%(从 244 ms 降至 128 ms)。
  • RoboTwin 2.0 硬性设定下平均成功率为 37.2%,在域随机化条件下优于强基线。
  • 消融实验表明门控实现了帕累托最优的速度/精度;文本规划与视觉想象两者均起作用;固定步长预测(t+Δt)不及完成状态预测。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。