QUICK REVIEW

[论文解读] HiLM-D: Enhancing MLLMs with Multi-Scale High-Resolution Details for Autonomous Driving

Xinpeng Ding, Jianhua Han|arXiv (Cornell University)|Sep 11, 2023

Multimodal Machine Learning Applications被引用 22

一句话总结

HiLM-D 通过引入多尺度高分辨率视觉细节和专业的查询检测头来增强自动驾驶的多模态大模型，以实现对驾驶场景中精确边界框预测和风险对象理解。

ABSTRACT

Recent efforts to use natural language for interpretable driving focus mainly on planning, neglecting perception tasks. In this paper, we address this gap by introducing ROLISP (Risk Object Localization and Intention and Suggestion Prediction), which towards interpretable risk object detection and suggestion for ego car motions. Accurate ROLISP implementation requires extensive reasoning to identify critical traffic objects and infer their intentions, prompting us to explore the capabilities of multimodal large language models (MLLMs). However, the limited perception performance of CLIP-ViT vision encoders in existing MLLMs struggles with capturing essential visual perception information, e.g., high-resolution, multi-scale and visual-related inductive biases, which are important for autonomous driving. Addressing these challenges, we introduce HiLM-D, a resource-efficient framework that enhances visual information processing in MLLMs for ROLISP. Our method is motivated by the fact that the primary variations in autonomous driving scenarios are the motion trajectories rather than the semantic or appearance information (e.g., the shapes and colors) of objects. Hence, the visual process of HiLM-D is a two-stream framework: (i) a temporal reasoning stream, receiving low-resolution dynamic video content, to capture temporal semantics, and (ii) a spatial perception stream, receiving a single high-resolution frame, to capture holistic visual perception-related information. The spatial perception stream can be made very lightweight by a well-designed P-Adapter, which is lightweight, training-efficient, and easily integrated into existing MLLMs. Experiments on the DRAMA-ROLISP dataset show HiLM-D's significant improvements over current MLLMs, with a 3.7% in BLEU-4 for captioning and 8.7% in mIoU for detection.

研究动机与目标

在多模态大模型中推进对自动驾驶的高分辨率场景理解。
通过 ST-Adapters 将视频感知的时空特征融入多模态大模型。
在基于LLM的框架内实现目标检测与边界框推理。
研究不同查询检测头与位置信息表示对检测性能的影响。

提出的方法

通过逐通道三维卷积引入 ST-Adapters 将视频特征与 LLM 表征融合。
在基线 MLLM（MiniGPT-4及其变体）中扩展一个辅助检测器，以从 LLM 隐状态生成边界框。
比较几种查询检测头（QDH）架构，包括基于LLM的回归、DETR风格，以及我们提出的方法。
在目标定位中尝试位置表示（数值坐标与额外词汇表）。
对 LLM 的冻结与基于 LoRA 的微调进行消融实验，以评估效率与性能。

实验结果

研究问题

RQ1多尺度高分辨率视觉细节是否能提升多模态大模型在自动驾驶中的目标定位和风险理解？
RQ2不同查询检测头架构对 MLLMs 中边界框精度有什么影响？
RQ3位置表示与训练策略（LoRA 与冻结）如何影响检测和描述性能？

主要发现

类型	描述 AVG	检测 B4	mIoU
Vocab.	54.7	43.2	49.0
Numerical	55.8	48.9	52.4
Ours	55.8	59.6	57.7
LoRA	—	59.6	—
Frozen	55.8	59.6	—

直接使用数值坐标进行边界框定位优于使用额外的坐标词汇表。
在交叉注意力中引入LLM知情先验的提出方法，与DETR风格方法相比，在 mIoU 与检测指标上具有竞争力甚至优越。
基于 LoRA 的微调可以高效并取得较强的性能，在检测和描述指标上有时甚至超越冻结的 LLM。
冻结 LLM 可以实现高效训练，并获得与微调替代方案相当的描述与检测结果。
在所给出的消融实验中，Ours QDH 配置实现了最高的检测准确性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。