QUICK REVIEW

[论文解读] Deep Continuous Fusion for Multi-Sensor 3D Object Detection

Ming Liang, Bin Yang|arXiv (Cornell University)|Dec 20, 2020

Advanced Neural Network Applications参考文献 39被引用 430

一句话总结

提出一种双流的端到端3D对象检测器，持续将相机图像特征融合到LIDAR BEV骨干网络中，利用连续融合层实现改进的多传感器3D定位。

ABSTRACT

In this paper, we propose a novel 3D object detector that can exploit both LIDAR as well as cameras to perform very accurate localization. Towards this goal, we design an end-to-end learnable architecture that exploits continuous convolutions to fuse image and LIDAR feature maps at different levels of resolution. Our proposed continuous fusion layer encode both discrete-state image features as well as continuous geometric information. This enables us to design a novel, reliable and efficient end-to-end learnable 3D object detector based on multiple sensors. Our experimental evaluation on both KITTI as well as a large scale 3D object detection benchmark shows significant improvements over the state of the art.

研究动机与目标

Motivate robust 3D object detection using complementary camera and LIDAR data in autonomous driving.
Develop a learnable fusion mechanism that preserves geometric information across modalities.
Enable end-to-end training with continuous, multi-scale fusion for BEV-based detection.
Demonstrate real-time performance and strong accuracy on KITTI and TOR4D benchmarks.

提出的方法

Propose a dual-stream network with image and LIDAR BEV branches.
Introduce a continuous fusion layer that projects image features into BEV and fuses them with LIDAR BEV features via a KNN-based interpolation and an MLP that incorporates 3D offsets.
Use deep parametric continuous convolution to interpolate dense BEV features from sparse image-LIDAR correspondences.
Fuse multi-scale image features into BEV across four fusion layers in a feature pyramid-style BEV backbone.
Train end-to-end with a multi-task loss combining classification and regression terms for 3D bounding boxes and orientation.

实验结果

研究问题

RQ1Can continuous fusion of image and LIDAR features in BEV space improve 3D object detection over LIDAR-only and coarse fusion baselines?
RQ2How do KNN pooling and geometric offset features affect cross-modal fusion performance?
RQ3What are the trade-offs between accuracy and real-time inference with multi-scale continuous fusion?

主要发现

Input	Time (s)	3D AP easy	3D AP moderate	3D AP hard	BEV AP easy	BEV AP moderate	BEV AP hard
MV3D [6]	0.24	66.77	52.73	51.31	85.82	77.00	68.94
VxNet [39]	0.22	77.49	65.11	57.73	89.35	79.26	77.39
NVLidarNet	0.1	n/a	n/a	n/a	84.44	80.04	74.31
PIXOR [37]	0.035	n/a	n/a	n/a	87.25	81.92	76.01
F-PC_CNN [8]	0.5	60.06	48.07	45.22	83.77	75.26	70.17
MV3D [6]	0.36	71.09	62.35	55.12	86.02	76.90	68.49
AVOD-FPN [18]	0.1	81.94	71.88	66.38	88.53	83.79	77.90
F-PointNet [26]	0.17	81.20	70.39	62.19	88.70	84.00	75.33
AVOD [18]	0.08	73.59	65.78	58.38	86.80	85.44	77.73
Our Cont Fuse	0.06	82.54	66.22	64.04	88.81	85.83	77.33

Outperforms state-of-the-art methods on KITTI BEV and competitive on 3D detection, with strong real-time performance (>15 FPS).
KITTI results show Our Cont Fuse achieves 3D AP easy 82.54, moderate 66.22, hard 64.04 and BEV AP easy 88.81, moderate 85.83, hard 77.33.
TOR4D results show strong long-range performance with multi-class BEV detection (Vehicle AP0.5 94.94, Vehicle AP0.7 75.34; Pedestrian AP0.3 83.89, AP0.5 74.08; Bicyclist AP0.3 82.32, AP0.5 59.83).
Compared to LIDAR-only and discrete fusion baselines, continuous fusion with KNN pooling and geometric offsets yields consistent gains across metrics.
Ablation studies show both KNN pooling and the geometric offset input are important; removing either degrades performance.
Long-range advantages are pronounced on TOR4D, particularly when x increases, indicating effective fusion for distant objects.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。