QUICK REVIEW

[论文解读] M2DA: Multi-Modal Fusion Transformer Incorporating Driver Attention for Autonomous Driving

Dongyang Xu, Haokun Li|arXiv (Cornell University)|Mar 19, 2024

Autonomous Vehicle Technology and Safety被引用 6

一句话总结

M2DA 引入 LVAFusion，用于跨模态摄像头与 LiDAR 的融合，并具备驾驶员注意力机制，在 CARLA 中评估，达到最先进的驾驶性能且需要更少数据。

ABSTRACT

End-to-end autonomous driving has witnessed remarkable progress. However, the extensive deployment of autonomous vehicles has yet to be realized, primarily due to 1) inefficient multi-modal environment perception: how to integrate data from multi-modal sensors more efficiently; 2) non-human-like scene understanding: how to effectively locate and predict critical risky agents in traffic scenarios like an experienced driver. To overcome these challenges, in this paper, we propose a Multi-Modal fusion transformer incorporating Driver Attention (M2DA) for autonomous driving. To better fuse multi-modal data and achieve higher alignment between different modalities, a novel Lidar-Vision-Attention-based Fusion (LVAFusion) module is proposed. By incorporating driver attention, we empower the human-like scene understanding ability to autonomous vehicles to identify crucial areas within complex scenarios precisely and ensure safety. We conduct experiments on the CARLA simulator and achieve state-of-the-art performance with less data in closed-loop benchmarks. Source codes are available at https://anonymous.4open.science/r/M2DA-4772.

研究动机与目标

解决端到端自动驾驶中多模态环境感知的低效问题。
融入驾驶员注意力以实现类人场景理解。
开发 LVAFusion，以提升图像与 LiDAR 之间的跨模态对齐。
使用变换器预测自车路径点与辅助感知状态。
在 CARLA Town05 Long 与 Longest6 基准上验证性能。

提出的方法

提出 LVAFusion，一种基于跨注意力的融合模块，使用全局和局部特征、位置编码以及视角/传感器嵌入。
纳入驾驶员注意力预测模块，生成基于凝视的掩码以调制图像特征。
使用两个跨注意力阶段（点云先）将 LiDAR 与多视图图像融合成统一的标记序列。
使用变换器编码器处理融合特征，解码器使用路径点、感知和交通状态查询。
自回归地用基于 GRU 的增量预测自车路径点，以及辅助感知地图和交通状态。
通过对规则基专家数据集的模仿学习端到端训练，对路径点使用 L1 损失，感知与交通状态使用辅助损失。

实验结果

研究问题

RQ1相较于以往的融合方法，LVAFusion 是否能提升 LiDAR 与相机模态之间的对齐与交互建模？
RQ2将驾驶员注意力引入是否能在复杂城市/对抗场景中提升端到端自动驾驶性能？
RQ3相对于最先进方法，M2DA 在 CARLA Town05 Long 与 Longest6 基准上的表现如何？
RQ4在较小数据集训练时，M2DA 的数据效率是多少？

主要发现

方法	融合	模态	额外监督	数据集	DS ↑	RC ↑	IS ↑
CILRS	ResNet + Flatten	C1	None	-	7.8±0.3	10.3±0.0	0.75±0.05
LBC	ResNet + Flatten	C3	Expert	157K	12.3±2.0	31.9±2.2	0.66±0.02
Transfuser	Fusion via Transformer	C3L1	Dep+Seg+Map+Box	228K	31.0±3.6	47.5±5.3	0.77±0.04
Roach	ResNet + Flatten	C1	Expert	-	41.6±1.8	96.4±2.1	0.43±0.03
LAV	PointPainting	C4L1	Expert+Seg+Map+Box	189K	46.5±2.3	69.8±2.3	0.73±0.02
TCP	ResNet + Flatten	C1	Expert	189K	57.2±1.5	80.4±1.5	0.73±0.02
MILE	ResNet + Flatten	C1	Map+Box	2.9M	61.1±3.2	97.4±0.8	0.63±0.03
Interfuser	Fusion via Transformer	C3L1	Box	3M	68.3±1.9	95.0±2.9	-
ThinkTwice	Geometric Fusion in BEV	C4L1	Expert+Dep+Seg+Map	2M	70.9±3.4	95.5±2.6	0.75±0.05
DriveAdapter	Geometric Fusion in BEV	C4L1	Expert+Seg+Map	2M	71.9	97.3	0.74
M2DA (ours)	LVAFusion	C3L1	Box	200K	72.6±5.7	89.7±7.8	0.80±0.05

M2DA 在 Town05 Long 上达到最先进的驾驶性能，DS 72.6±5.7 和 IS 0.80±0.05，使用 200K 训练帧。
LVAFusion 与先验信息的跨注意力相比随机查询基线提升多模态对齐。
将驾驶员注意力引入相较于仅摄像头或未注意的基线，降低违规并提升总体驾驶分。
M2DA 在 Town05 Long 的关键指标上超过 Transfuser 与 Roach，并超越若干更大数据驱动模型。
消融研究表明，增加三摄像头输入并带驾驶员注意力的 LiDAR 可获得最佳结果（3C1A1L）。
M2DA 通过仅使用 200K 帧即可优于若干 2–3M 帧的基线，展示数据效率。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。