[论文解读] DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries
DETR3D 提供一种自顶向下、无 NMS 的三维目标检测框架,使用在多视图 RGB 图像中的稀疏三维目标查询,这些查询反向投影到各摄像头的二维特征并通过变换器融合信息。
We introduce a framework for multi-camera 3D object detection. In contrast to existing works, which estimate 3D bounding boxes directly from monocular images or use depth prediction networks to generate input for 3D object detection from 2D information, our method manipulates predictions directly in 3D space. Our architecture extracts 2D features from multiple camera images and then uses a sparse set of 3D object queries to index into these 2D features, linking 3D positions to multi-view images using camera transformation matrices. Finally, our model makes a bounding box prediction per object query, using a set-to-set loss to measure the discrepancy between the ground-truth and the prediction. This top-down approach outperforms its bottom-up counterpart in which object bounding box prediction follows per-pixel depth estimation, since it does not suffer from the compounding error introduced by a depth prediction model. Moreover, our method does not require post-processing such as non-maximum suppression, dramatically improving inference speed. We achieve state-of-the-art performance on the nuScenes autonomous driving benchmark.
研究动机与目标
- Motivate 3D object detection from RGB images without dense depth prediction or point cloud reconstruction.
- Propose a top-down, set-based detection head that links 2D features to 3D boxes via backward projection across multiple cameras.
- Eliminate post-processing like non-maximum suppression to improve inference speed.
- Demonstrate state-of-the-art performance on nuScenes and analyze overlap-region and pseudo-LiDAR comparisons.
提出的方法
- Extract multi-view RGB features with a shared ResNet and FPN.
- Initialize a sparse set of 3D object queries that decode to 3D reference points.
- Project 3D reference points into all camera views using known camera matrices and sample image features via bilinear interpolation.
- Refine object queries through iterative self-attention across layers, incorporating multi-view information.
- Predict 3D bounding boxes and class labels per query using per-layer outputs and train with a set-to-set loss (Hungarian matching).
实验结果
研究问题
- RQ1Can 3D object detection be effectively achieved directly in 3D space from multi-view RGB images without depth prediction or post-processing?
- RQ2Does integrating multi-view information at every computation layer improve accuracy, especially in camera overlap regions?
- RQ3How does a NMS-free, set-based head compare to traditional NMS-based multi-view fusion methods on nuScenes?
- RQ4What is the impact of iterative refinement and the number of object queries on detection performance?
- RQ5How does DETR3D compare to pseudo-LiDAR approaches that rely on depth estimation?
主要发现
| Method | NDS ↑ | mAP ↑ | mATE ↓ | mASE ↓ | mAOE ↓ | mAVE ↓ | mAAE ↓ | 非极大抑制(NMS) |
|---|---|---|---|---|---|---|---|---|
| CenterNet | 0.328 | 0.306 | 0.716 | 0.264 | 0.609 | 1.426 | 0.658 | ✓ |
| FCOS3D | 0.373 | 0.299 | 0.785 | 0.268 | 0.557 | 1.396 | 0.154 | ✓ |
| FCOS3D | 0.393 | 0.321 | 0.746 | 0.265 | 0.503 | 1.351 | 0.160 | ✓ |
| FCOS3D S | 0.402 | 0.326 | 0.743 | 0.259 | 0.441 | 1.341 | 0.163 | ✓ |
| FCOS3D P | 0.415 | 0.343 | 0.725 | 0.263 | 0.422 | 1.292 | 0.153 | ✓ |
| DETR3D (Ours) | 0.374 | 0.303 | 0.860 | 0.278 | 0.437 | 0.967 | 0.235 | - |
| DETR3D (Ours) | 0.425 | 0.346 | 0.773 | 0.268 | 0.383 | 0.842 | 0.216 | - |
| DETR3D (Ours) # | 0.434 | 0.349 | 0.716 | 0.268 | 0.379 | 0.842 | 0.200 | - |
- DETR3D achieves state-of-the-art performance on nuScenes without any post-processing like NMS.
- In overlap regions, DETR3D significantly outperforms depth-based fusion methods.
- The model remains robust without explicit depth prediction and benefits from fused multi-view information at each computation layer.
- Iterative refinement across 6 DETR3D layers improves NDS and mAP, with larger numbers of queries continuing to improve performance up to saturation.
- Compared to pseudo-LiDAR baselines, DETR3D substantially outperforms with respect to NDS and mAP.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。