QUICK REVIEW

[论文解读] M$^2$BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation

Enze Xie, Zhiding Yu|arXiv (Cornell University)|Apr 11, 2022

Robotics and Sensor-Based Localization被引用 86

一句话总结

引入 M2BEV，是一个统一的多摄像头框架，在共享 BEV 表示中联合执行3D对象检测和 BEV 分割，在 nuScenes 上实现了最先进的结果，同时具有高效性。

ABSTRACT

In this paper, we propose M$^2$BEV, a unified framework that jointly performs 3D object detection and map segmentation in the Birds Eye View~(BEV) space with multi-camera image inputs. Unlike the majority of previous works which separately process detection and segmentation, M$^2$BEV infers both tasks with a unified model and improves efficiency. M$^2$BEV efficiently transforms multi-view 2D image features into the 3D BEV feature in ego-car coordinates. Such BEV representation is important as it enables different tasks to share a single encoder. Our framework further contains four important designs that benefit both accuracy and efficiency: (1) An efficient BEV encoder design that reduces the spatial dimension of a voxel feature map. (2) A dynamic box assignment strategy that uses learning-to-match to assign ground-truth 3D boxes with anchors. (3) A BEV centerness re-weighting that reinforces with larger weights for more distant predictions, and (4) Large-scale 2D detection pre-training and auxiliary supervision. We show that these designs significantly benefit the ill-posed camera-based 3D perception tasks where depth information is missing. M$^2$BEV is memory efficient, allowing significantly higher resolution images as input, with faster inference speed. Experiments on nuScenes show that M$^2$BEV achieves state-of-the-art results in both 3D object detection and BEV segmentation, with the best single model achieving 42.5 mAP and 57.0 mIoU in these two tasks, respectively.

研究动机与目标

通过同时处理 3D 检测和 BEV 分割，推动自动驾驶的统一360度感知。
开发基于 BEV 的表示，利用单个编码器实现多视角、多任务学习。
通过 Spatial-to-Channel BEV 编码、动态锚框分配和 BEV 中心性等新组件提升效率与准确性。

提出的方法

将多视角的 2D 图像特征转换为以自车坐标表示的 3D voxel 表示。
使用所提出的 Spatial-to-Channel（S2C）算子将 voxel 转换为 BEV 特征以降低 Z 维。
在 BEV 特征上应用一个轻量级的 3D 检测头（来自 PointPillars），并采用动态 3D 锚框分配策略。
添加 BEV 分割头以在 BEV 中预测可行驶区域和车道边界；使用 BEV 中心性对远距离样本进行重加权。
使用大规模 2D 检测预训练（nuImage）和 2D 辅助监督来提升 3D 任务。
采用联合损失进行训练：L_total = L_det3d + L_seg3d + L_det2d，包含各任务特定损失。
使用 AdamW 进行优化；输入分辨率固定为 1600x900；不进行数据增强；对骨干网络选择和编码器设计进行消融。

实验结果

研究问题

RQ1一个统一的 BEV 表示是否能够在多视角相机设置中同时支持 3D 对象检测和 BEV 分割？
RQ2面向效率的 BEV 编码器设计和动态锚框分配是否能提升基于相机的 3D 检测与 BEV 分割？
RQ3大规模 2D 预训练和 2D 辅助监督对 3D 感知性能的影响如何？
RQ4在共享 BEV 框架中，联合多任务训练是否对 3D 检测和 BEV 分割有帮助？

主要发现

M2BEV 在 nuScenes 的 3D 目标检测（mAP 0.425）和 BEV 分割（mIoU 57.0）上实现了单模型的 state-of-the-art。
联合训练对单项任务性能有小幅负面影响，但提供了一个共享的编码器和跨任务的效率收益。
通过 Spatial-to-Channel (S2C) 的高效 BEV 编码相比于 naïve 的 3D 卷积减少了内存和计算，使得输入分辨率更高且推理更快。
动态 3D 锚框分配比固定 IoU 匹配在 mAP 提升多达 7.8 个百分点，在 NDS 提升多达 4.8 点。
在 nuImage 上进行 2D 检测预训练显著提升了 3D 检测指标（如 mAP 高达 +13.5）并加速收敛；2D 辅助监督进一步提升了性能。
BEV 中心性提升了 BEV 分割，尤其对远距离区域， Spatial-to-Channel BEV 编码器使得更深的细化在成本更低的情况下成为可能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。