QUICK REVIEW

[论文解读] STS: Surround-view Temporal Stereo for Multi-view 3D Detection

Zengran Wang, Chen Min|arXiv (Cornell University)|Aug 22, 2022

Advanced Vision and Imaging被引用 26

一句话总结

STS 引入环视-时序立体视觉以利用跨相机与时序几何进行深度估计，在 nuScenes 上提升多视角 3D 检测的准确性。

ABSTRACT

Learning accurate depth is essential to multi-view 3D object detection. Recent approaches mainly learn depth from monocular images, which confront inherent difficulties due to the ill-posed nature of monocular depth learning. Instead of using a sole monocular depth method, in this work, we propose a novel Surround-view Temporal Stereo (STS) technique that leverages the geometry correspondence between frames across time to facilitate accurate depth learning. Specifically, we regard the field of views from all cameras around the ego vehicle as a unified view, namely surroundview, and conduct temporal stereo matching on it. The resulting geometrical correspondence between different frames from STS is utilized and combined with the monocular depth to yield final depth prediction. Comprehensive experiments on nuScenes show that STS greatly boosts 3D detection ability, notably for medium and long distance objects. On BEVDepth with ResNet-50 backbone, STS improves mAP and NDS by 2.6% and 1.4%, respectively. Consistent improvements are observed when using a larger backbone and a larger image resolution, demonstrating its effectiveness

研究动机与目标

激励在多视角 3D 检测中超越单目深度的深度估计改进。
利用跨时间和跨相机的几何信息，创建环视时序立体框架（STS）。
整合 SID 深度抽样以更好地对近远点进行采样。
将 STS 深度与单目深度融合，以在纹理缺失的区域和移动对象上实现鲁棒的密集深度预测。

提出的方法

在每个参考位置生成深度假设，并使用可微分单应矩阵在所有相机上将历史帧的特征扭曲。
通过对齐后的源特征与参考特征之间的分组相关性构建一个轻量级代价体。
使用 Spacing-Increasing Discretization（SID）在深度空间中非均匀采样深度假设。
通过逐元素求和再经过 softmax 将 STS 深度与单目深度融合，得到最终深度分布。
保留单目深度模块以处理纹理缺失的区域和移动对象，从而实现互补深度融合。

实验结果

研究问题

RQ1环视时序立体视觉（STS）是否能够超越单目深度本身，在多视角 3D 检测中改进深度学习？
RQ2跨相机时序对应关系和 SID 采样如何影响深度精度和基于 BEV 的检测性能？
RQ3将 STS 深度与单目深度融合对在不同距离和场景中的整体检测指标有何影响？

主要发现

方法	分辨率	mAP ↑	mATE ↓	mASE ↓	mAOE ↓	mAVE ↓	mAAE ↓	NDS ↑
BEVDepth	ResNet-50	0.351	0.639	0.267	0.479	0.428	0.198	0.475
Ours	ResNet-50	0.377	0.601	0.275	0.450	0.446	0.212	0.489
BEVDepth	ResNet-50	0.405	0.570	0.266	0.383	0.368	0.206	0.523
Ours	ResNet-50	0.425	0.532	0.267	0.390	0.369	0.212	0.536
BEVDepth	ConvNeXt	0.462	0.540	0.254	0.353	0.379	0.200	0.558
Ours	ConvNeXt	0.473	0.515	0.259	0.320	0.366	0.197	0.571

STS 在 nuScenes BEVDepth 基线上取得可衡量的增益，在不同配置下提升 mAP 和 NDS。
在 ResNet-50 与 256x704 输入下，STS 相较 BEVDepth 将 mAP 提升 2.6%，NDS 提升 1.4%。
在 512x1408 分辨率、ResNet-50 下，STS 的 mAP 提升到 0.425，NDS 提升到 0.536（对比 BEVDepth 的 0.405 与 0.523）。
使用 ConvNeXt 主干在 512x1408 下，STS 实现 mAP 0.473 与 NDS 0.571（对比 BEVDepth 0.462 与 0.558）。
消融研究显示环视是关键（1.1% mAP 增益），而 SID 增强了 STS 的性能，特别是中长距离目标。
深度融合（STS + 单目）优于任一模块单独实现，在 mAP 和 NDS 上有显著提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。