QUICK REVIEW

[论文解读] DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection

Yingwei Li, Adams Wei Yu|arXiv (Cornell University)|Mar 15, 2022

Advanced Neural Network Applications被引用 24

一句话总结

本论文提出 DeepFusion，一种将激光雷达和摄像头数据在深度特征层进行融合的方法，使用 InverseAug 和 LearnableAlign 以实现最先进的多模态3D目标检测，在 Waymo Open Dataset 上具有强鲁棒性和远距离检测的提升。

ABSTRACT

Lidars and cameras are critical sensors that provide complementary information for 3D detection in autonomous driving. While prevalent multi-modal methods simply decorate raw lidar point clouds with camera features and feed them directly to existing 3D detection models, our study shows that fusing camera features with deep lidar features instead of raw points, can lead to better performance. However, as those features are often augmented and aggregated, a key challenge in fusion is how to effectively align the transformed features from two modalities. In this paper, we propose two novel techniques: InverseAug that inverses geometric-related augmentations, e.g., rotation, to enable accurate geometric alignment between lidar points and image pixels, and LearnableAlign that leverages cross-attention to dynamically capture the correlations between image and lidar features during fusion. Based on InverseAug and LearnableAlign, we develop a family of generic multi-modal 3D detection models named DeepFusion, which is more accurate than previous methods. For example, DeepFusion improves PointPillars, CenterPoint, and 3D-MAN baselines on Pedestrian detection for 6.7, 8.9, and 6.2 LEVEL_2 APH, respectively. Notably, our models achieve state-of-the-art performance on Waymo Open Dataset, and show strong model robustness against input corruptions and out-of-distribution data. Code will be publicly available at https://github.com/tensorflow/lingvo/tree/master/lingvo/.

研究动机与目标

通过融合来自激光雷达和摄像头的深度特征，而非原始点云，来推动有效的多模态3D目标检测。
提出对齐感知的融合技术，以应对几何相关的数据增强。
开发一个通用的 DeepFusion 框架，可插入到现有的基于体素的3D检测器中并提升性能。
展示对输入损坏和分布外数据的鲁棒性。

提出的方法

采用深度特征融合流水线，将来自 Pillar/ voxel 基础网络的激光雷达特征与来自二维图像主干的摄像头特征进行融合。
引入 InverseAug，以逆转几何相关的数据增强，从而实现跨模态的准确对齐。
引入 LearnableAlign，一个跨注意力模块，动态加权摄像头特征以与体素级激光雷达到对齐。
将 DeepFusion 定位为一个可插入的插件，能够以端到端可训练的方式提升现有的基于体素的检测器，如 PointPillars、CenterPoint 和 3D-MAN。
表明深度特征层的摄像头特征在端到端训练中提供更高分辨率的上下文信息，同时保持对齐。
在 Waymo Open Dataset 上进行评估，并与单模态基线及先前的多模态方法进行比较。

实验结果

研究问题

RQ1将深度摄像头和激光雷达到对齐相对于输入级融合，是否能提升多模态3D检测？
RQ2InverseAug 和 LearnableAlign 在多样的数据增强与真实世界条件下，能否提供鲁棒、准确的跨模态对齐？
RQ3DeepFusion 是否是一种通用的、可插拔的融合方法，能够提升一系列基于体素的3D检测器？
RQ4多模态 DeepFusion 在远距离目标检测以及分布漂移或输入损坏条件下的表现如何？

主要发现

方法	AP/L1	APH/L1	AP/L2	APH/L2
DeepFusion-Ens (ours) ∗	84.37	83.22	79.54	78.41
InceptioLidar	83.80	82.46	79.15	77.84
AFDetV2-Ens [12]	84.07	82.63	79.04	77.64
Octopus_Noah	83.10	81.67	78.65	77.27
HorizonLiDAR3D [6] ∗	83.28	81.85	78.49	77.11
DeepFusion (ours) ∗	81.89	80.48	76.91	75.54
Cascade3D	81.17	79.63	75.84	74.36
INT	80.29	78.81	75.30	73.89
IUI	80.00	78.60	74.94	73.60
XMU	80.53	78.77	75.14	73.45
Octopus-det	79.25	77.75	74.63	73.20
AFDetV2 [12] ∗	79.77	78.21	74.60	73.12
LENOVO_LR_PCIE_Det	79.46	78.07	74.31	72.97
LENOVO_LR_PCIE_RT_Det	79.42	77.98	74.31	72.97
CenterPoint++ [44] ∗	79.41	77.96	74.22	72.82
SST_v1 [8] ∗	79.99	78.31	74.41	72.81

DeepFusion 在 Waymo Open Dataset 上提升了多个基线检测器，尤其是在远距离目标上取得显著增益。
在 Waymo 验证集上，DeepFusion-Ens 取得领先性能，DeepFusion 在 LEVEL_2 APH 上超越了以前的多模态方法。
消融实验表明 InverseAug 和 LearnableAlign 都对性能有贡献，其中 InverseAug 提供了更大增益。
DeepFusion 持续提升仅激光雷达模型在 PointPillars、CenterPoint 和 3D-MAN 基线上的性能（例如，在 LEVEL_2 上 APH 提升 +6.7 到 +8.9）。
DeepFusion 实现了最先进的 Waymo 结果，并展示了对输入损坏和分布外数据的鲁棒性，在扰动下的退化小于单模态模型。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。