QUICK REVIEW

[論文レビュー] HiLM-D: Enhancing MLLMs with Multi-Scale High-Resolution Details for Autonomous Driving

Xinpeng Ding, Jianhua Han|arXiv (Cornell University)|Sep 11, 2023

Multimodal Machine Learning Applications被引用数 22

ひとこと要約

HiLM-D は自動運転のマルチモーダル LLM を、マルチスケールの高解像度ビジュアル詳細と運転シーンでの正確な境界ボックス予測とリスクオブジェクト理解を可能にする専門的なクエリ検出ヘッドを組み込むことで強化します。

ABSTRACT

Recent efforts to use natural language for interpretable driving focus mainly on planning, neglecting perception tasks. In this paper, we address this gap by introducing ROLISP (Risk Object Localization and Intention and Suggestion Prediction), which towards interpretable risk object detection and suggestion for ego car motions. Accurate ROLISP implementation requires extensive reasoning to identify critical traffic objects and infer their intentions, prompting us to explore the capabilities of multimodal large language models (MLLMs). However, the limited perception performance of CLIP-ViT vision encoders in existing MLLMs struggles with capturing essential visual perception information, e.g., high-resolution, multi-scale and visual-related inductive biases, which are important for autonomous driving. Addressing these challenges, we introduce HiLM-D, a resource-efficient framework that enhances visual information processing in MLLMs for ROLISP. Our method is motivated by the fact that the primary variations in autonomous driving scenarios are the motion trajectories rather than the semantic or appearance information (e.g., the shapes and colors) of objects. Hence, the visual process of HiLM-D is a two-stream framework: (i) a temporal reasoning stream, receiving low-resolution dynamic video content, to capture temporal semantics, and (ii) a spatial perception stream, receiving a single high-resolution frame, to capture holistic visual perception-related information. The spatial perception stream can be made very lightweight by a well-designed P-Adapter, which is lightweight, training-efficient, and easily integrated into existing MLLMs. Experiments on the DRAMA-ROLISP dataset show HiLM-D's significant improvements over current MLLMs, with a 3.7% in BLEU-4 for captioning and 8.7% in mIoU for detection.

研究の動機と目的

マルチモーダル LLM 内で自動運転の高解像度シーン理解を動機づける。
ST-Adapters を通じて動画対応の時空間特徴を MLLMs に統合する。
LLM ベースのフレームワーク内で物体検出と境界ボックス推論を可能にする。
検出性能に対するさまざまなクエリ検出ヘッドと位置表現の影響を調査する。

提案手法

深さ方向の3D畳み込みを用いて動画特徴とLLM表現を融合する ST-Adapters を導入する。
ベースラインの MLLM（MiniGPT-4 および派生モデル）を補助検出器で拡張し、LLM の隠れ状態から境界ボックスを生成する。
LLM ベース回帰、DETR型、そして提案手法を含む複数のクエリ検出ヘッド（QDH）アーキテクチャを比較する。
物体局在化のための位置表現（数値座標対追加語彙）を実験する。
LLM の凍結学習 vs LoRA ベースのファインチューニングのアブレーションを実施し、効率と性能を評価する。

実験結果

リサーチクエスチョン

RQ1マルチスケールの高解像度ビジュアル詳細は、自動運転の MLLMs における物体局在化とリスク理解を改善できるか？
RQ2LLMs 内での境界ボックス精度に対するさまざまなクエリ検出ヘッドアーキテクチャの影響は何か？
RQ3位置表現と学習戦略（LoRA 対冷凍/凍結）が検出とキャプショニングの性能にどう影響するか？

主な発見

Type	Captioning AVG	Detection B4	mIoU
語彙	54.7	43.2	49.0
数値	55.8	48.9	52.4
私たちの	55.8	59.6	57.7
LoRA	—	59.6	—
凍結	55.8	59.6	—

境界ボックス局在化に数値座標を直接使用する方が、追加の座標語彙を使用するよりも優れている。
クロスアテンションにおけるLLM情報を用いた事前知識を組み込んだ提案手法は、DETRスタイル手法と比較して競争力のあるまたは優れた mIoU および検出指標を示す。
LoRA ベースのファインチューニングは効率的で高い性能を発揮し、検出およびキャプショニング指標で凍結された LLM を上回ることもある。
LLM を凍結することで、ファインチューニングされた代替手法と同等のキャプショニングおよび検出結果を得つつ、効率的なトレーニングが可能となる。
提示されたアブレーションで Ours QDH 設定が最も高い検出精度を達成する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。