Skip to main content
QUICK REVIEW

[论文解读] VLM-AutoDrive: Post-Training Vision-Language Models for Safety-Critical Autonomous Driving Events

Mohammad Qazim Bhat, Yufan Huang|arXiv (Cornell University)|Mar 18, 2026
Multimodal Machine Learning Applications被引用 0
一句话总结

tldr: 一个模块化的后训练框架(VLM-AutoDrive)将预训练视觉-语言模型适配以检测安全关键的驾驶事件(Collision, Near-Collision),利用多模态监督和思考链推理,在零-shot 基线之上实现显著提升。

ABSTRACT

The rapid growth of ego-centric dashcam footage presents a major challenge for detecting safety-critical events such as collisions and near-collisions, scenarios that are brief, rare, and difficult for generic vision models to capture. While multimodal large language models (MLLMs) demonstrate strong general reasoning ability, they underperform in driving contexts due to domain and temporal misalignment. We introduce VLM-AutoDrive, a modular post-training framework for adapting pretrained Vision-Language Models (VLMs) to high-fidelity anomaly detection. The framework integrates metadata-derived captions, LLM-generated descriptions, visual question answering (VQA) pairs, and chain-of-thought (CoT) reasoning supervision to enable domain-aligned and interpretable learning. Off-the-shelf VLMs such as NVIDIA's Cosmos-Reason1 7B (CR1) exhibit near-zero Collision recall in zero-shot settings; fine-tuning with VLM-AutoDrive improves Collision F1 from 0.00 to 0.69 and overall accuracy from 35.35% to 77.27%. VLM-AutoDrive offers a scalable recipe for adapting general-purpose VLMs to safety-critical, temporally localized perception tasks. Evaluated on real-world Nexar dashcam videos, it achieves substantial gains in Collision and Near-Collision detection while producing interpretable reasoning traces, bridging the gap between perception, causality, and decision reasoning in autonomous driving.

研究动机与目标

  • Demonstrate the limitations of zero-shot VLMs for high-temporal-fidelity driving anomaly detection.
  • Propose a modular post-training framework (VLM-AutoDrive) to align VLMs with domain-specific driving tasks.
  • Build a diverse supervision pipeline (captions, VQA, MCQs, and CoT reasoning) to improve temporal sensitivity and interpretability.
  • Showcase scalability and extensibility to additional driving anomalies beyond Collision detection.

提出的方法

  • Analyze zero-shot performance of pretrained VLMs on driving anomaly detection to identify domain gaps.
  • Fine-tune base VLMs with multimodal supervision signals (MCQs, captions, VQA, and reasoning traces) in a supervised fine-tuning (SFT) regime, optionally followed by RL.
  • Generate a large, diverse annotation pipeline from Nexar dashcam data including metadata-derived captions, LLM outputs, VQA pairs, and reasoning traces to guide learning.
  • Use a sliding-window chunking strategy to create 4–6 second clips with high frame rates to capture brief events, and balance data across classes.
Figure 1 : Sliding Window Chunking.
Figure 1 : Sliding Window Chunking.

实验结果

研究问题

  • RQ1Can general-purpose VLMs detect safety-critical driving events in zero-shot settings, or is domain-specific adaptation required?
  • RQ2Does multimodal, reasoning-informed supervision improve detection of collisions and near-collisions in ego-centric dashcam videos?
  • RQ3What data signals (captions, VQA, MCQs, CoT) most effectively align VLMs with high temporal fidelity driving anomalies?
  • RQ4Is the approach scalable to additional driving anomaly classes with minimal retraining?

主要发现

  • Zero-shot VLMs show near-zero recall for Collision in driving contexts without domain adaptation.
  • Post-training with VLM-AutoDrive significantly improves Collision detection (e.g., Collision F1 rising from 0.00 to 0.69 in some baselines) and overall accuracy (e.g., up to 77.27% in the reported setup).
  • Diverse supervision signals (MCQs, captions, VQA) and reasoning traces help preserve and enhance chain-of-thought capabilities during fine-tuning, improving interpretability.
  • High temporal fidelity (30 FPS) and data balance are critical; increased frame rate and corrected class balance yield the largest gains.
  • Reasoning supervision (Reasoning MCQs and Reasoning VQA) can yield interpretable think traces and improve reasoning-mode performance without sacrificing classification accuracy.
  • The framework demonstrates scalability to incorporate additional anomaly types with minimal retraining.
Figure 2 : System Diagram.
Figure 2 : System Diagram.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。