[论文解读] Deep Continuous Fusion for Multi-Sensor 3D Object Detection
提出一种双流的端到端3D对象检测器,持续将相机图像特征融合到LIDAR BEV骨干网络中,利用连续融合层实现改进的多传感器3D定位。
In this paper, we propose a novel 3D object detector that can exploit both LIDAR as well as cameras to perform very accurate localization. Towards this goal, we design an end-to-end learnable architecture that exploits continuous convolutions to fuse image and LIDAR feature maps at different levels of resolution. Our proposed continuous fusion layer encode both discrete-state image features as well as continuous geometric information. This enables us to design a novel, reliable and efficient end-to-end learnable 3D object detector based on multiple sensors. Our experimental evaluation on both KITTI as well as a large scale 3D object detection benchmark shows significant improvements over the state of the art.
研究动机与目标
- Motivate robust 3D object detection using complementary camera and LIDAR data in autonomous driving.
- Develop a learnable fusion mechanism that preserves geometric information across modalities.
- Enable end-to-end training with continuous, multi-scale fusion for BEV-based detection.
- Demonstrate real-time performance and strong accuracy on KITTI and TOR4D benchmarks.
提出的方法
- Propose a dual-stream network with image and LIDAR BEV branches.
- Introduce a continuous fusion layer that projects image features into BEV and fuses them with LIDAR BEV features via a KNN-based interpolation and an MLP that incorporates 3D offsets.
- Use deep parametric continuous convolution to interpolate dense BEV features from sparse image-LIDAR correspondences.
- Fuse multi-scale image features into BEV across four fusion layers in a feature pyramid-style BEV backbone.
- Train end-to-end with a multi-task loss combining classification and regression terms for 3D bounding boxes and orientation.
实验结果
研究问题
- RQ1Can continuous fusion of image and LIDAR features in BEV space improve 3D object detection over LIDAR-only and coarse fusion baselines?
- RQ2How do KNN pooling and geometric offset features affect cross-modal fusion performance?
- RQ3What are the trade-offs between accuracy and real-time inference with multi-scale continuous fusion?
主要发现
| Input | Time (s) | 3D AP easy | 3D AP moderate | 3D AP hard | BEV AP easy | BEV AP moderate | BEV AP hard |
|---|---|---|---|---|---|---|---|
| MV3D [6] | 0.24 | 66.77 | 52.73 | 51.31 | 85.82 | 77.00 | 68.94 |
| VxNet [39] | 0.22 | 77.49 | 65.11 | 57.73 | 89.35 | 79.26 | 77.39 |
| NVLidarNet | 0.1 | n/a | n/a | n/a | 84.44 | 80.04 | 74.31 |
| PIXOR [37] | 0.035 | n/a | n/a | n/a | 87.25 | 81.92 | 76.01 |
| F-PC_CNN [8] | 0.5 | 60.06 | 48.07 | 45.22 | 83.77 | 75.26 | 70.17 |
| MV3D [6] | 0.36 | 71.09 | 62.35 | 55.12 | 86.02 | 76.90 | 68.49 |
| AVOD-FPN [18] | 0.1 | 81.94 | 71.88 | 66.38 | 88.53 | 83.79 | 77.90 |
| F-PointNet [26] | 0.17 | 81.20 | 70.39 | 62.19 | 88.70 | 84.00 | 75.33 |
| AVOD [18] | 0.08 | 73.59 | 65.78 | 58.38 | 86.80 | 85.44 | 77.73 |
| Our Cont Fuse | 0.06 | 82.54 | 66.22 | 64.04 | 88.81 | 85.83 | 77.33 |
- Outperforms state-of-the-art methods on KITTI BEV and competitive on 3D detection, with strong real-time performance (>15 FPS).
- KITTI results show Our Cont Fuse achieves 3D AP easy 82.54, moderate 66.22, hard 64.04 and BEV AP easy 88.81, moderate 85.83, hard 77.33.
- TOR4D results show strong long-range performance with multi-class BEV detection (Vehicle AP0.5 94.94, Vehicle AP0.7 75.34; Pedestrian AP0.3 83.89, AP0.5 74.08; Bicyclist AP0.3 82.32, AP0.5 59.83).
- Compared to LIDAR-only and discrete fusion baselines, continuous fusion with KNN pooling and geometric offsets yields consistent gains across metrics.
- Ablation studies show both KNN pooling and the geometric offset input are important; removing either degrades performance.
- Long-range advantages are pronounced on TOR4D, particularly when x increases, indicating effective fusion for distant objects.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。