QUICK REVIEW

[論文レビュー] DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Xiaoyu Tian, Junru Gu|arXiv (Cornell University)|Feb 19, 2024

Multimodal Machine Learning Applications被引用数 22

ひとこと要約

DriveVLMは、自己運転における場面の記述、分析、階層的計画のために、チェーン・オブ・ソウト（CoT）プロセスを組み込んだ vision-language モデルを活用します。DriveVLM-Dual はこれをリアルタイムの空間 grounding および計画のために従来のパイプラインと組み合わせます。著者らはまた、シーン理解と計画のためのデータセットと評価フレームワークである SUP-AD を導入します。

ABSTRACT

A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. Experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the efficacy of DriveVLM and DriveVLM-Dual in handling complex and unpredictable driving conditions. Finally, we deploy the DriveVLM-Dual on a production vehicle, verifying it is effective in real-world autonomous driving environments.

研究の動機と目的

Urban autonomous driving における長尾シナリオやデリケートな人間行動を含む場面理解の課題に対処する。
大規模な vision-language モデルを活用して環境記述、物体分析、階層的計画を実行する。
空間 grounding と速度における VLM の限界を、ハイブリッドな DriveVLM-Dual システムで緩和する。
Scene Understanding for Planning (SUP) タスクを定義し、評価指標を備えた専用の SUP-AD データセットを構築する。

提案手法

CoT 機構を備えた大規模 vision-language モデルを用いて画像列を処理する。
CoT を scene description、scene analysis、hierarchical planning の3つのモジュールに分解する。
環境を記述し、重要な物体を特定して scene understanding input を形成する。
重要物体の静的属性、運動状態、挙動を分析し、ego 車両への影響を推定する。
3つの段階で planning を出力する：meta-actions、decision descriptions、trajectory waypoints。
DriveVLM-Dual を導入して DriveVLM を従来の3D perceptionと高頻度の planning と組み合わせ、リアルタイム性能を実現する。

実験結果

リサーチクエスチョン

RQ1複雑な運転シナリオにおいて、vision-language モデルをどのように活用して堅牢な場面理解と計画を実現できるか？
RQ2VLM ベースの推論と従来の3D perception・planning パイプラインを統合する際の利点と限界は何か？
RQ3専用の SUP タスクと SUP-AD データセットは、運転モデルの場面理解と計画能力を効果的に評価できるか？
RQ4ハイブリッドな DriveVLM-Dual システムは、エンドツーエンドの VLM や従来パイプラインと比較して空間 grounding とリアルタイム軌道計画を改善するか？

主な発見

DriveVLM は nuScenes および SUP-AD データセットにおける複雑で長尾の運転シナリオの取り扱いで優れた性能を示す。
DriveVLM-Dual は perception モジュールと協調した場合、nuScenes で最先端の planning 性能を達成し、計画誤差と衝突の notable な削減を含む。
アブレーション研究は、critical object analysis と 3D perception prompts の追加が planning の精度と安全性指標を改善することを示す。
DriveVLM-Dual は slow-fast アーキテクチャを用いてリアルタイムの planning 能力を提供する：DriveVLM が参照軌道を提供し、fast な従来プランナーがそれを洗練させる。
SUP-AD データセットと評価フレームワークは、運転タスクにおける scene description、scene analysis、meta-action planning を評価可能にする。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。