Skip to main content
QUICK REVIEW

[论文解读] Deep Continuous Fusion for Multi-Sensor 3D Object Detection

Ming Liang, Bin Yang|arXiv (Cornell University)|Dec 20, 2020
Advanced Neural Network Applications参考文献 39被引用 430
一句话总结

提出一种双流的端到端3D对象检测器,持续将相机图像特征融合到LIDAR BEV骨干网络中,利用连续融合层实现改进的多传感器3D定位。

ABSTRACT

In this paper, we propose a novel 3D object detector that can exploit both LIDAR as well as cameras to perform very accurate localization. Towards this goal, we design an end-to-end learnable architecture that exploits continuous convolutions to fuse image and LIDAR feature maps at different levels of resolution. Our proposed continuous fusion layer encode both discrete-state image features as well as continuous geometric information. This enables us to design a novel, reliable and efficient end-to-end learnable 3D object detector based on multiple sensors. Our experimental evaluation on both KITTI as well as a large scale 3D object detection benchmark shows significant improvements over the state of the art.

研究动机与目标

  • Motivate robust 3D object detection using complementary camera and LIDAR data in autonomous driving.
  • Develop a learnable fusion mechanism that preserves geometric information across modalities.
  • Enable end-to-end training with continuous, multi-scale fusion for BEV-based detection.
  • Demonstrate real-time performance and strong accuracy on KITTI and TOR4D benchmarks.

提出的方法

  • Propose a dual-stream network with image and LIDAR BEV branches.
  • Introduce a continuous fusion layer that projects image features into BEV and fuses them with LIDAR BEV features via a KNN-based interpolation and an MLP that incorporates 3D offsets.
  • Use deep parametric continuous convolution to interpolate dense BEV features from sparse image-LIDAR correspondences.
  • Fuse multi-scale image features into BEV across four fusion layers in a feature pyramid-style BEV backbone.
  • Train end-to-end with a multi-task loss combining classification and regression terms for 3D bounding boxes and orientation.

实验结果

研究问题

  • RQ1Can continuous fusion of image and LIDAR features in BEV space improve 3D object detection over LIDAR-only and coarse fusion baselines?
  • RQ2How do KNN pooling and geometric offset features affect cross-modal fusion performance?
  • RQ3What are the trade-offs between accuracy and real-time inference with multi-scale continuous fusion?

主要发现

InputTime (s)3D AP easy3D AP moderate3D AP hardBEV AP easyBEV AP moderateBEV AP hard
MV3D [6]0.2466.7752.7351.3185.8277.0068.94
VxNet [39]0.2277.4965.1157.7389.3579.2677.39
NVLidarNet0.1n/an/an/a84.4480.0474.31
PIXOR [37]0.035n/an/an/a87.2581.9276.01
F-PC_CNN [8]0.560.0648.0745.2283.7775.2670.17
MV3D [6]0.3671.0962.3555.1286.0276.9068.49
AVOD-FPN [18]0.181.9471.8866.3888.5383.7977.90
F-PointNet [26]0.1781.2070.3962.1988.7084.0075.33
AVOD [18]0.0873.5965.7858.3886.8085.4477.73
Our Cont Fuse0.0682.5466.2264.0488.8185.8377.33
  • Outperforms state-of-the-art methods on KITTI BEV and competitive on 3D detection, with strong real-time performance (>15 FPS).
  • KITTI results show Our Cont Fuse achieves 3D AP easy 82.54, moderate 66.22, hard 64.04 and BEV AP easy 88.81, moderate 85.83, hard 77.33.
  • TOR4D results show strong long-range performance with multi-class BEV detection (Vehicle AP0.5 94.94, Vehicle AP0.7 75.34; Pedestrian AP0.3 83.89, AP0.5 74.08; Bicyclist AP0.3 82.32, AP0.5 59.83).
  • Compared to LIDAR-only and discrete fusion baselines, continuous fusion with KNN pooling and geometric offsets yields consistent gains across metrics.
  • Ablation studies show both KNN pooling and the geometric offset input are important; removing either degrades performance.
  • Long-range advantages are pronounced on TOR4D, particularly when x increases, indicating effective fusion for distant objects.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。