QUICK REVIEW

[论文解读] Sparse4D: Multi-view 3D Object Detection with Sparse Spatial-Temporal Fusion

Xuewu Lin, Tianwei Lin|arXiv (Cornell University)|Nov 19, 2022

Advanced Image and Video Retrieval Techniques被引用 30

一句话总结

Sparse4D 引入稀疏多视图3D检测，采用可变形4D采样（point, timestamp, view, scale）和分层融合，以及一个实例深度重权模组，在 nuScenes 上达到最先进的稀疏方法性能。

ABSTRACT

Bird-eye-view (BEV) based methods have made great progress recently in multi-view 3D detection task. Comparing with BEV based methods, sparse based methods lag behind in performance, but still have lots of non-negligible merits. To push sparse 3D detection further, in this work, we introduce a novel method, named Sparse4D, which does the iterative refinement of anchor boxes via sparsely sampling and fusing spatial-temporal features. (1) Sparse 4D Sampling: for each 3D anchor, we assign multiple 4D keypoints, which are then projected to multi-view/scale/timestamp image features to sample corresponding features; (2) Hierarchy Feature Fusion: we hierarchically fuse sampled features of different view/scale, different timestamp and different keypoints to generate high-quality instance feature. In this way, Sparse4D can efficiently and effectively achieve 3D detection without relying on dense view transformation nor global attention, and is more friendly to edge devices deployment. Furthermore, we introduce an instance-level depth reweight module to alleviate the ill-posed issue in 3D-to-2D projection. In experiment, our method outperforms all sparse based methods and most BEV based methods on detection task in the nuScenes dataset.

研究动机与目标

旨在提升稀疏（非密集）多视图3D检测，使其与基于 BEV 的方法具有竞争力。
提出跨时间、视图和尺度的多4D关键点的稀疏采样，以获得更丰富的实例特征。
开发可变形的4D聚合，以高效地融合多维特征。
引入实例级深度重权模块，以减轻图像基础3D感知中的深度模糊问题。

提出的方法

为每个3D锚点分配多个4D关键点，并从多视图、多尺度和多时间戳的图像特征采样。
将4D关键点投影到图像特征图中，并通过跨尺度、视图和时间的双线性插值进行采样。
使用组级加权和时序融合的分层融合来产生经精炼的实例特征。
引入一个利用深度分布进行重加权的实例级深度重权模块，以在无 LiDAR 监督的情况下改进特征的深度信息利用。

实验结果

研究问题

RQ1稀疏4D关键点在时间、视图和尺度上的采样能否缩小稀疏检测与基于 BEV 的3D检测之间的性能差距？
RQ2可变形的4D聚合是否实现了对时空上下文的高效而有效的融合，从而细化3D盒子？
RQ3实例级深度重权是否能在无 LiDAR 监督的情况下提升摄像头基3D检测对深度线索的利用？

主要发现

Sparse4D 在 nuScenes 的3D检测基准上优于现有稀疏基础方法。
多帧历史时间融合带来显著增益；在 T=4 时，与 T=1 相比，mAP 和 NDS 显著提升。
深度重权和可学习关键点提供额外的性能提升，mAP 和 NDS 的综合增益明显。
运动补偿（自身和对象）显著提升定位和速度估计的准确性。
通过多阶段 refinement 和历史帧，Sparse4D 在关键指标上接近或超越若干基于 BEV 的方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。