QUICK REVIEW

[論文レビュー] Drive-JEPA: Video JEPA Meets Multimodal Trajectory Distillation for End-to-End Driving

Linhan Wang, Yang, Zichong|arXiv (Cornell University)|Jan 29, 2026

Autonomous Vehicle Technology and Safety被引用数 0

ひとこと要約

Drive-JEPA combines Video Joint-Embedding Predictive Architecture (V-JEPA) pretraining with multimodal trajectory distillation to enable end-to-end driving, achieving state-of-the-art results on NAVSIM and competitive performance on Bench2Drive without heavy perception annotations.

ABSTRACT

End-to-end autonomous driving increasingly leverages self-supervised video pretraining to learn transferable planning representations. However, pretraining video world models for scene understanding has so far brought only limited improvements. This limitation is compounded by the inherent ambiguity of driving: each scene typically provides only a single human trajectory, making it difficult to learn multimodal behaviors. In this work, we propose Drive-JEPA, a framework that integrates Video Joint-Embedding Predictive Architecture (V-JEPA) with multimodal trajectory distillation for end-to-end driving. First, we adapt V-JEPA for end-to-end driving, pretraining a ViT encoder on large-scale driving videos to produce predictive representations aligned with trajectory planning. Second, we introduce a proposal-centric planner that distills diverse simulator-generated trajectories alongside human trajectories, with a momentum-aware selection mechanism to promote stable and safe behavior. When evaluated on NAVSIM, the V-JEPA representation combined with a simple transformer-based decoder outperforms prior methods by 3 PDMS in the perception-free setting. The complete Drive-JEPA framework achieves 93.3 PDMS on v1 and 87.8 EPDMS on v2, setting a new state-of-the-art.

研究の動機と目的

Motivate end-to-end autonomous driving that leverages scalable video pretraining for planning representations.
Address multimodal driving futures despite single human trajectory per scene by distilling simulator trajectories.
Integrate a lightweight planner with proposal-centric generation and momentum-aware selection for stable, safe decisions.

提案手法

Pretrain a ViT encoder on large-scale driving videos using the V-JEPA objective to learn planning-aligned representations.
Generate online waypoint-anchored trajectory proposals with deformable attention and BEV feature sampling.
Distill multimodal trajectories from a simulator by building a large trajectory vocabulary and selecting high-quality pseudo-teachers for supervision.
Train a momentum-aware trajectory scorer that blends safety, comfort, and cross-frame stability to select the final trajectory.
Incorporate lightweight auxiliary tasks (proposal-centric mapping and collision prediction) to enrich spatiotemporal understanding without heavy computation.

Figure 1 : Comparison between end-to-end planners on both perception-free and perception-based settings.

実験結果

リサーチクエスチョン

RQ1Can V-JEPA-based video pretraining improve planning representations for end-to-end driving beyond perception-based learning?
RQ2Does multimodal trajectory distillation from simulators provide diverse, safe supervision that surpasses single-human-trajectory guidance?
RQ3Does momentum-aware trajectory selection improve temporal stability and driving comfort in an online planning loop?
RQ4What gains can be achieved in perception-free and perception-based settings on NAVSIM and Bench2Drive with Drive-JEPA?
RQ5How does Drive-JEPA perform with varying backbone choices (ResNet34 vs ViT) and input modalities?

主な発見

Drive-JEPA achieves state-of-the-art PDMS on NAVSIM v1 and NAVSIM v2 benchmarks.
Perception-free evaluation with a lightweight decoder and V-JEPA pretraining matches or exceeds several perception-based methods.
Multimodal Trajectory Distillation increases proposal diversity and, when combined with momentum-aware selection, improves driving comfort and temporal stability.
On Bench2Drive, Drive-JEPA attains top Driving Score and competitive Efficiency, illustrating strong closed-loop performance.
Ablations show V-JEPA pretraining and multimodal supervision jointly contribute to improvements, with momentum-aware selection further boosting comfort metrics.

Figure 2 : Overview of the Drive-JEPA architecture. Driving Video Pretraining learns a ViT encoder from large-scale driving videos using the self-supervised V-JEPA objective. Given the pretrained features, Waypoint-anchored Proposal Generation efficiently produces multiple trajectory proposals, whos

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。