QUICK REVIEW

[论文解读] BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection

Junjie Huang, Guan Huang|arXiv (Cornell University)|Mar 31, 2022

Advanced Neural Network Applications被引用 156

一句话总结

BEVDet4D 将 BEVDet 扩展到时空四维空间，通过融合当前帧和前一帧的 BEV 特征，在几乎无开销的情况下提升速度预测，且在 nuScenes 上实现基于视觉的最先进三维检测。

ABSTRACT

Single frame data contains finite information which limits the performance of the existing vision-based multi-camera 3D object detection paradigms. For fundamentally pushing the performance boundary in this area, a novel paradigm dubbed BEVDet4D is proposed to lift the scalable BEVDet paradigm from the spatial-only 3D space to the spatial-temporal 4D space. We upgrade the naive BEVDet framework with a few modifications just for fusing the feature from the previous frame with the corresponding one in the current frame. In this way, with negligible additional computing budget, we enable BEVDet4D to access the temporal cues by querying and comparing the two candidate features. Beyond this, we simplify the task of velocity prediction by removing the factors of ego-motion and time in the learning target. As a result, BEVDet4D with robust generalization performance reduces the velocity error by up to -62.9%. This makes the vision-based methods, for the first time, become comparable with those relied on LiDAR or radar in this aspect. On challenge benchmark nuScenes, we report a new record of 54.5% NDS with the high-performance configuration dubbed BEVDet4D-Base, which surpasses the previous leading method BEVDet-Base by +7.3% NDS. The source code is publicly available for further research at https://github.com/HuangJunJie2017/BEVDet .

研究动机与目标

将 BEVDet 从仅空间扩展到时空四维融合，以利用时间线索。
在保留 BEVDet 架构的同时，引入轻量级的时序融合机制。
通过预测相邻 BEV 特征之间的位置偏移来简化速度学习，而不是直接学习绝对速度。
在 nuScenes 上展示在推理开销极小的情况下，速度、姿态和属性误差的改进。

提出的方法

保留 BEVDet 的图像视图编码器、视图变换器、BEV 编码器和任务头；通过在对齐后将先前的 BEV 特征与当前帧拼接来添加时序融合。
在融合前对前一帧的 BEV 特征进行简单的空间对齐，以消除自运动。
在时序融合之前引入额外的 BEV 编码器，以调整稀疏特征并稳定学习。
将速度预测表述为相邻 BEV 特征之间的平移，将自运动从目标学习信号中移除。
探索通过旋转和平移的对齐（式(2)）并在需要时使用双线性插值实现特征对齐（式(3)）。
使用 nuScenes 指标进行评估（mAP、mATE、mASE、mAOE、mAVE、mAAE、NDS），并报告推理速度（FPS）。

实验结果

研究问题

RQ1在纯视觉、多摄像头设置中，来自相邻两帧的 BEV 特征的时序融合是否能够提升速度和整体三维对象检测性能？
RQ2需要哪些对齐和网络调整来使自运动与时序特征差异解耦并稳定速度预测的学习？
RQ3BEVDet4D 在 nuScenes 上在准确性和速度方面与最先进的基于视觉的基线相比如何？

主要发现

方法	模态	mAP	mATE	mASE	mAOE	mAVE	mAAE	NDS	FPS
BEVDet4D-Tiny	Camera	0.338	0.672	0.274	0.519	0.337	0.185	0.476	15.5
BEVDet4D-Base	Camera	0.426	0.579	0.254	0.317	0.301	0.191	0.552	-

BEVDet4D-Tiny 将速度误差降低了 62.9%（AVE 从 0.909 降至 0.337 mAVE），相比 BEVDet-Tiny 在 nuScenes 验证集的 NDS 提升了 8.4%。
BEVDet4D-Base 在 nuScenes 验证集获得 54.5% 的 NDS，在测试集为 56.9% NDS，超过了以往的基于视觉的方法和 BEVDet 变体，同时维持相近的延迟。
在额外的 BEV 编码器后进行时序融合可获得最佳权衡，与早期配置相比，在 mAP、NDS 和速度指标方面有显著提升。
利用时间线索，BEVDet4D 将速度精度与激光雷达/雷达基线的差距缩小，在 nuScenes 验证集上实现与非 RGB 模态的 AVE 竞争力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。