QUICK REVIEW

[論文レビュー] FUTURE-VLA: Forecasting Unified Trajectories Under Real-time Execution

Jingjing Fan, Yushan Liu|arXiv (Cornell University)|Feb 5, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

FUTURE-VLAは長期的な時空文脈を圧縮し、アクションのチャンクと未来のビジュアルを自己回帰的に予測する統合的な視覚-言語-行動アーキテクチャを提示し、リアルタイム実行を人間-in-the-loopの安全ゲートと共に実現します。

ABSTRACT

General vision-language models increasingly support unified spatiotemporal reasoning over long video streams, yet deploying such capabilities on robots remains constrained by the prohibitive latency of processing long-horizon histories and generating high-dimensional future predictions. To bridge this gap, we present FUTURE-VLA, a unified architecture that reformulates long-horizon control and future forecasting as a monolithic sequence-generation task. Adopting a dual-sided efficiency paradigm, FUTURE-VLA leverages a temporally adaptive compression strategy to maximize spatiotemporal information density, enabling the ingestion of extensive multi-view histories while maintaining constant inference latency. Simultaneously, it performs latent-space autoregression to align actionable dynamics with reviewable visual look-aheads in a single forward pass. These real-time predictive capabilities further enable a prediction-guided Human-In-the-Loop mechanism via interactive execution gating, allowing operators to dynamically validate behaviors based on interpretable future previews. Extensive evaluations demonstrate that FUTURE-VLA establishes new state-of-the-art performance, attaining success rates of 99.2% on LIBERO, 75.4% on RoboTwin, and 78.0% on a real-world Piper platform, all with a $16 imes$ extended spatiotemporal window while maintaining the inference latency of a single-frame baseline.

研究の動機と目的

ロボット視覚言語モデルにおける長期的な認識と意思決定の効率化のニーズを動機づける。
時空圧縮と潜在空間自己回帰を組み合わせた二面性の効率性フレームワークを開発する。
予測に導かれた人間-in-the-loopの実行を長期的な操作の安全性向上のために有効にする。
LIBERO、RoboTwin、Piperの実世界ベンチマークで最先端性能を示しつつ単一フレームのレイテンシを維持する。

提案手法

固定トークン予算の下で情報密度を最大化するための凍結済みDINOv3エンコーダを用いた堅牢な視覚特徴の取得。
Qwen3-VLバックボーンと共有語彙へ離散的なアクションと視覚潜在コードをマッピングすることにより統一トークン空間を採用する。
周波数領域でアクションチャンクを符号化するFASTスペクトルトークナイゼーションを適用し、軌道表現をコンパクト化する。
将来のビジュアルを1Dのコンパクトトークナイザ（1フレームあたり32トークン）と学習済みコードブックで自己回帰的な1D潜在予測として表現する。
動的な実行ゲーティングとリサンプリング回復を含む予測 guided Human-In-The-Loop 機構を実装して安全性を確保する。
リアルタイム推論レイテンシを単一フレームのベースラインと同等に維持しつつ、時空的ルックアヘッドを16倍拡張する。

Figure 1 : Comparison of VLA-WM Architectures. (a) Modular Fragmentation: Independent VLA and World Model operating with decoupled representations. (b) Instantaneous Unification: A unified framework integrating perception and prediction within a short-horizon temporal window. (c) FUTURE-VLA (Ours):

実験結果

リサーチクエスチョン

RQ1長期的な文脈を manipulated fidelityを損なうことなく効果的に圧縮できるか。
RQ2潜在空間自己回帰はリアルタイムでアクションチャンクと未来プレビューを同時に生成できるか。
RQ3予測に導かれた人間-in-the-loop実行は長期的なロボット作業の安全性とタスク成功率を向上させるか。
RQ4FUTURE-VLAは従来のVLA/WMアプローチと比較して、 diverse benchmarks（LIBERO、RoboTwin）および実世界プラットフォームでどう性能を示すか。

主な発見

LIBERO（HILあり）での最先端成功率99.2%、RoboTwinで75.4%、Piper実世界で78.0%のベンチマーク達成。
単一フレームの推論レイテンシを維持しつつ、双方向の時空ウィンドウを16倍拡張。
高精度タスクを含む強力な操作性能を示す（例：Stack Bowls 94%、RoboTwinでPick Dual Bottles 92%など、HILありで）。
適応的な時系列圧縮により高解像度の最近の観測を保持しつつ遠い履歴を圧縮し、長期的な知覚を可能にする。
1D視覚トークナイゼーション（フレームあたり32トークン）により再構成品質を維持し、信頼性の高い未来のロールアウトを実現する。

Figure 2 : The architecture of FUTURE-VLA. On the input side, multi-view historical observations are encoded via a frozen DINOv3 encoder and processed through Temporally Adaptive Cascaded Compression to maximize information density under a fixed token budget. On the output side, the model autoregres

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。