QUICK REVIEW

[논문 리뷰] Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos

Matthew Strong, Wei-Jer Chang|arXiv (Cornell University)|2026. 02. 25.

Advanced Vision and Imaging인용 수 0

한 줄 요약

LFG는 라벨이 없고 포즈가 없는 YouTube 자가 시점 비디오에서 기하학, 모션, 의미를 인식하는 모델을 사전 학습한 뒤 단일 앞 카메라로 계획을 미세 조정하여 강력한 계획 성능과 데이터 효율성을 달성합니다.

ABSTRACT

Ego-centric driving videos available online provide an abundant source of visual data for autonomous driving, yet their lack of annotations makes it difficult to learn representations that capture both semantic structure and 3D geometry. Recent advances in large feedforward spatial models demonstrate that point maps and ego-motion can be inferred in a single forward pass, suggesting a promising direction for scalable driving perception. We therefore propose a label-free, teacher-guided framework for learning autonomous driving representations directly from unposed videos. Unlike prior self-supervised approaches that focus primarily on frame-to-frame consistency, we posit that safe and reactive driving depends critically on temporal context. To this end, we leverage a feedforward architecture equipped with a lightweight autoregressive module, trained using multi-modal supervisory signals that guide the model to jointly predict current and future point maps, camera poses, semantic segmentation, and motion masks. Multi-modal teachers provide sequence-level pseudo-supervision, enabling LFG to learn a unified pseudo-4D representation from raw YouTube videos without poses, labels, or LiDAR. The resulting encoder not only transfers effectively to downstream autonomous driving planning on the NAVSIM benchmark, surpassing multi-camera and LiDAR baselines with only a single monocular camera, but also yields strong performance when evaluated on a range of semantic, geometric, and qualitative motion prediction tasks. These geometry and motion-aware features position LFG as a compelling video-centric foundation model for autonomous driving.

연구 동기 및 목표

포즈나 라벨이 없는 대규모 비표시 자가 시점 비디오에서 견고한 운전 표현을 학습하도록 동기를 부여한다.
라벨이 없는 교사-유도 사전 학습 프레임워크를 개발하여 미래의 기하학, 의미, 모션을 예측한다.
경량 자동회귀 확장을 피드포워드 3D 재구성 백본에 적용하여 단기 예측을 가능하게 한다.
명시적 라벨 없이도 기하학, 의미, 모션을 감독하기 위해 다중 모달 교사 신호를 활용한다.
데이터 효율적인 파인튜닝으로 계획 및 기타 다운스트림 태스크로의 강한 전이를 입증한다.

제안 방법

포즈 없는 비디오로부터 미래의 기하학, 의미, 모션을 예측하기 위해 프리트레인된 인코더(pi3)와 인과적 자동회귀 트랜스포머를 사용한다.
다중 모달 교사(SegFormer은 의미, SAM2 및 CoTracker3은 모션)를 활용하여 비 라벨 데이터에 의사 라벨을 제공한다.
관측 프레임과 미래 프레임에 대한 점 지도, 카메라 포즈, 의미 맵, 신뢰도 맵, 모션 마스크를 포함하는 통합된 의사-4D 표현을 예측하도록 학습한다.
SegFormer 의사 라벨로 학습된 의미 헤드를 도입하여 미래 프레임의 의미를 생성한다.
첫 프레임으부터 인스턴스를 추적하고 교사로부터 3D 모션 추정치를 역투영하여 의사-그라운드 트루스 모션 마스크를 구성하고 감독 모션 예측을 가능하게 한다.
세분화, 포즈, 점-맵, 신뢰도, 모션 손실을 결합한 복합 손실로 최적화하고, 외삽을 촉진하기 위해 미래 프레임에 추가 가중치를 둔다.

실험 결과

연구 질문

RQ1포즈나 라벨이 없는 대규모 비표시 자가 시점 운전 비디오를 사용하여 기하학, 모션, 의미를 인식하는 표현을 학습할 수 있는가?
RQ2라벨이 없는 교사-유도 사전 학습 접근법이 최소한의 라벨 데이터로 다운스트림 자율주행 계획으로 얼마나 잘 전달되는가?
RQ3단기 자동회귀 확장이 단일 카메라 시스템으로 계획을 위한 동적 장면 구조를 포착하는가?
RQ4학습된 인코더의 계획에 대한 데이터 효율성은 BEV 기반 및 다중 센서 기준과 비교하여 어떤가?

주요 결과

방법	입력	NC	DAC	TTC	C.	EP	PDMS
UniAD	6Cam	98.2	93.7	94.4	100.0	79.1	85.2
TransFuser	3Cam+L	97.7	92.8	92.0	100.0	79.2	84.0
Hydra-MDP	3Cam+L	96.9	94.0	94.0	100.0	78.7	84.7
DiffusionDrive	3Cam+L	96.8	95.4	94.7	100.0	82.0	88.1
LFG (Ours)	1Cam	98.2	93.7	94.4	100.0	79.1	85.2

LFG는 NAVSIM에서 단일 전면 카메라만으로 최첨단 계획 성능을 달성하며 일부 다중 뷰 및 LiDAR 기반 기준보다 우수하다.
라벨 데이터가 10%에 불과한 상황에서도 LFG는 경쟁력 있는 계획 성능을 달성해 강한 데이터 효율성을 입증한다.
LFG의 사전 학습 인코더는 계획뿐 아니라 의미, 기하학, 모션 작업으로의 전이를 효과적으로 수행한다(예: 깊이 및 3D 점 지도).
모델은 시간적으로 일관된 기하학 및 단기간 미래 자아 모션을 예측하며 미래 프레임에서도 품질을 유지한다.
단일 전면 뷰 카메라와 LFG는 계획 벤치마크에서 더 풍부한 센서 구성을 사용하는 BEV 기반 시스템과 대등하거나 우수한 성능을 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.