QUICK REVIEW

[논문 리뷰] ST-VLA: Enabling 4D-Aware Spatiotemporal Understanding for General Robot Manipulation

You Wu, Zixuan Chen|arXiv (Cornell University)|2026. 03. 14.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

ST-VLA는 통일된 3D-4D 표현과 대규모 ST-Human 데이터셋을 도입하여 고수준 시공간 추론 VLM(ST-VLM)을 가능하게 하며, 이는 저수준 3D 정책을 안내하고 오픈 월드 환경에서 제로샷 및 장기 지향 조작을 강하게 달성한다.

ABSTRACT

Robotic manipulation in open-world environments requires reasoning across semantics, geometry, and long-horizon action dynamics. Existing hierarchical Vision-Language-Action (VLA) frameworks typically use 2D representations to connect high-level reasoning with low-level control, but lack depth awareness and temporal consistency, limiting robustness in complex 3D scenes. We propose ST-VLA, a hierarchical VLA framework using a unified 3D-4D representation to bridge perception and action. ST-VLA converts 2D guidance into 3D trajectories and generates smooth spatial masks that capture 4D spatio-temporal context, providing a stable interface between semantic reasoning and continuous control. To enable effective learning of such representations, we introduce ST-Human, a large-scale human manipulation dataset with 14 tasks and 300k episodes, annotated with 2D, 3D, and 4D supervision via a semi-automated pipeline. Using ST-Human, we train ST-VLM, a spatio-temporal vision-language model that generates spatially grounded and temporally coherent 3D representations to guide policy execution. The smooth spatial masks focus on task-relevant geometry and stabilize latent representations, enabling online replanning and long-horizon reasoning. Experiments on RLBench and real-world manipulation tasks show that \method significantly outperforms state-of-the-art baselines, improving zero-shot success rates by 44.6% and 30.3%. These results demonstrate that offloading spatio-temporal reasoning to VLMs with unified 3D-4D representations substantially improves robustness and generalization for open-world robotic manipulation. Project website: https://oucx117.github.io/ST-VLA/.

연구 동기 및 목표

통합된 3D-4D 중간 표현으로 의미 추론과 기하학적 실행 간의 다리 역할을 한다.
ST-Human으로 학습된 3D-4D grounding을 위한 고용량의 시공간 비전-언어 모델(ST-VLM)을 개발한다.
온라인 재계획 및 장기 지향 조작을 가능하게 하는 계층적 Vision-Language-Action 프레임워크를 통해 가능하게 한다.
시뮬레이션 및 현실 세계 로봇 조작 작업에서 강건성 및 일반화 성능을 입증한다.

제안 방법

3D-4D 표현으로 구성된 3D 트래젝토리와 매끄러운 공간 마스크로 구성된 ST-VLA를 도입한다.
300k 에피소드와 4.3M 샘플의 대규모 3D-4D 인간 조작 데이터세트 ST-Human을 만들어 다중 작업 미세조정을 위한.
ST-Human 및 공개 데이터셋에서 4B ST-VLM을 미세조정하여 2D 궤적을 3D-4D 표현으로 grounding하고 장기 지향 추론을 가능하게 한다.
고수준 ST-VLM이 3D-4D 지침을 출력하고 이를 보강된 관찰을 통해 저수준 3D 인식 정책(3DDA/3DFA)을 조건화하는 2단계 추론을 사용한다.
작업과 무관한 영역을 억제하고 실행 중 잠재 안정성을 유지하기 위한 매끄러운 마스킹 메커니즘을 제안한다.
ST-VLM과 ST-VLA를 RLBench, RoboRefit, CVBench, SAT, 그리고 실제 팬다 로봇 조작에서 평가하고 제로샷 일반화 및 장기 지향 성능을 2D 기반 기준선과 비교한다.

Figure 1 : ST-VLM bridges the semantic-physical gap via unified 3D-4D spatio-temporal representations. (Left) Existing 2D-based VLMs face geometric ambiguity and temporal inconsistency due to the semantic-physical mismatch. (Right) Our ST-VLA utilizes unified 3D-4D representations with explicit traj

실험 결과

연구 질문

RQ1통일된 3D-4D 중간 표현이 의미 추론과 3D 로봇 실행 간의 정합성을 개선할 수 있는가?
RQ2ST-Human으로 학습된 대규모 ST-VLM이 저수준 정책에 강건한 제로샷, 장기 지향 조작 능력을 부여하는가?
RQ33D-4D grounded priors가 오픈 월드 조작에서 보이지 않는 물체 및 혼잡한 환경에 대한 일반화에 어떤 영향을 미치는가?
RQ4ST-VLA를 사용했을 때 제로샷 성공, 안정성, 크로스-시나리오 일반화에서 2D 기반 baselines 대비 어떤 이득이 있는가?
RQ5실세계 로봇공학에서 4D-인식 계층 프레임워크로 온라인 재계획이 가능하는가?

주요 결과

ST-VLM은 RoboRefit, CVBench, SAT 데이터셋에서 기존 방법에 비해 최대 33.19% 향상을 달성한다.
RLBench에서 ST-VLA는 제로샷 성공률을 44.6% 향상시킨다.
현장 실험은 제로샷 일반화에서 평균 30.3% 향상 및 디스트랙터 강인성에서 40.8% 향상을 보이다.
ST-VLM은 깊이 추정 정확도 46.67%와 98.00% ST-Human-Spatial grounding을 달성하여 3D-4D 접지 가능성을 보여준다.
ST-VLA는 긴 지향의 순차 조작을 높은 안정성과 함께 가능하게 하며, 긴 지향 미지 시퀀스에서 전체 성공률 97.3%를 달성한다(ST-VLA(3DFA)).
ST-VLM(4B)은 보지 않은 ST-Human-Planning 작업에 대한 이전 강력한 전이 성능을 보이며 92.00%의 성공률을 달성한다.

Figure 2 : Overview of the ST-Human Dataset Construction and Unified 2D-3D-4D Task Generation.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.