QUICK REVIEW

[论文解读] ST-VLA: Enabling 4D-Aware Spatiotemporal Understanding for General Robot Manipulation

You Wu, Zixuan Chen|arXiv (Cornell University)|Mar 14, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

ST-VLA 引入统一的 3D-4D 表征和大规模 ST-Human 数据集，以实现高层时空推理的 VLM（ST-VLM），从而引导低层 3D 策略，在开放世界设置中实现强零-shot 与长时程操作。

ABSTRACT

Robotic manipulation in open-world environments requires reasoning across semantics, geometry, and long-horizon action dynamics. Existing hierarchical Vision-Language-Action (VLA) frameworks typically use 2D representations to connect high-level reasoning with low-level control, but lack depth awareness and temporal consistency, limiting robustness in complex 3D scenes. We propose ST-VLA, a hierarchical VLA framework using a unified 3D-4D representation to bridge perception and action. ST-VLA converts 2D guidance into 3D trajectories and generates smooth spatial masks that capture 4D spatio-temporal context, providing a stable interface between semantic reasoning and continuous control. To enable effective learning of such representations, we introduce ST-Human, a large-scale human manipulation dataset with 14 tasks and 300k episodes, annotated with 2D, 3D, and 4D supervision via a semi-automated pipeline. Using ST-Human, we train ST-VLM, a spatio-temporal vision-language model that generates spatially grounded and temporally coherent 3D representations to guide policy execution. The smooth spatial masks focus on task-relevant geometry and stabilize latent representations, enabling online replanning and long-horizon reasoning. Experiments on RLBench and real-world manipulation tasks show that \method significantly outperforms state-of-the-art baselines, improving zero-shot success rates by 44.6% and 30.3%. These results demonstrate that offloading spatio-temporal reasoning to VLMs with unified 3D-4D representations substantially improves robustness and generalization for open-world robotic manipulation. Project website: https://oucx117.github.io/ST-VLA/.

研究动机与目标

以统一的 3D-4D 中介表示桥接语义推理与几何执行。
开发在 ST-Human 上训练的高容量时空视觉语言模型（ST-VLM），用于 3D-4D 定位。
通过分层的 Vision-Language-Action 框架实现在线重新规划与长时程操作。
在仿真与真实机器人操作任务中展示鲁棒性与泛化能力。

提出的方法

引入 ST-VLA，一种使用由 3D 轨迹和光滑空间掩模组成的 3D-4D 表示的分层 VLA 框架。
创建 ST-Human，一个包含 300k 集成情节和 4.3M 样本的大规模 3D-4D 人类操作数据集，用于多任务微调。
在 ST-Human 和公开数据集上微调 4B 的 ST-VLM 模型，以将 2D 轨迹定 grounding 到 3D-4D 表征并实现长时程推理。
使用两阶段推理，高层 ST-VLM 输出 3D-4D 指引，通过增强观测条件化低层 3D 感知策略（3DDA/3DFA）。
提出一个平滑掩模机制，在执行过程中抑制与任务无关的区域并保持潜在变量的稳定性。
在 RLBench、RoboRefit、CVBench、SAT 以及真实世界的 panda 操作上评估 ST-VLM 和 ST-VLA，比较零-shot 泛化与长时程性能。

Figure 1 : ST-VLM bridges the semantic-physical gap via unified 3D-4D spatio-temporal representations. (Left) Existing 2D-based VLMs face geometric ambiguity and temporal inconsistency due to the semantic-physical mismatch. (Right) Our ST-VLA utilizes unified 3D-4D representations with explicit traj

实验结果

研究问题

RQ1统一的 3D-4D 中介表示是否能改善语义推理与 3D 机器人执行之间的对齐？
RQ2在 ST-Human 上训练的 ST-VLM 是否能为低层策略提供鲁棒的零-shot、长时程操作能力？
RQ33D-4D 定 grounding 的先验对在开放世界的未见物体与混乱环境中的泛化有何影响？
RQ4与基于 2D 的基线相比，使用 ST-VLA 在零-shot 成功率、稳定性和跨场景泛化方面的提升有多大？
RQ5在真实世界机器人中，基于 4D 感知的分层框架是否可行地实现在线重新规划？

主要发现

ST-VLM 在 RoboRefit、CVBench 和 SAT 数据集上的提升率高达 33.19%。
在 RLBench 上，ST-VLA 的零-shot 成功率提高了 44.6%。
真实世界实验显示相较基线，零-shot 泛化平均提升 30.3%，对干扰物鲁棒性提升 40.8%。
ST-VLM 达到深度估计准确率 46.67% 和 ST-Human-Spatial 定 grounding 98.00%，显示出强的 3D-4D 定 grounding 能力。
ST-VLA 实现长时程、序列化操作的高稳定性，在未见长时程序列中总体成功率达到 97.3%（ST-VLA(3DFA)）。
ST-VLM（4B）在未见 ST-Human-Planning 任务上展现出强迁移能力，成功率为 92.00%。

Figure 2 : Overview of the ST-Human Dataset Construction and Unified 2D-3D-4D Task Generation.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。