QUICK REVIEW

[论文解读] Unifying Voxel-based Representation with Transformer for 3D Object Detection

Yanwei Li, Yilun Chen|arXiv (Cornell University)|Jun 1, 2022

Advanced Neural Network Applications被引用 127

一句话总结

UVTR 将多模态输入（LiDAR 与摄像头）统一到一个共享的3D体素空间，并使用 transformer 解码器进行对象级检测与跟踪，在单模态和多模态设置中在 nuScenes 上取得领先结果。

ABSTRACT

In this work, we present a unified framework for multi-modality 3D object detection, named UVTR. The proposed method aims to unify multi-modality representations in the voxel space for accurate and robust single- or cross-modality 3D detection. To this end, the modality-specific space is first designed to represent different inputs in the voxel feature space. Different from previous work, our approach preserves the voxel space without height compression to alleviate semantic ambiguity and enable spatial connections. To make full use of the inputs from different sensors, the cross-modality interaction is then proposed, including knowledge transfer and modality fusion. In this way, geometry-aware expressions in point clouds and context-rich features in images are well utilized for better performance and robustness. The transformer decoder is applied to efficiently sample features from the unified space with learnable positions, which facilitates object-level interactions. In general, UVTR presents an early attempt to represent different modalities in a unified framework. It surpasses previous work in single- or multi-modality entries. The proposed method achieves leading performance in the nuScenes test set for both object detection and the following object tracking task. Code is made publicly available at https://github.com/dvlab-research/UVTR.

研究动机与目标

推动统一的基于体素的表示以弥合 LiDAR 和摄像头数据之间的模态差距。
在不进行高度压缩的情况下保留3D体素空间以降低语义歧义。
在统一空间内实现跨模态的知识迁移与特征融合。
利用 transformer 解码器实现高效的对象级交互与预测。
展示在 nuScenes 的单模态和多模态3D检测与跟踪方面的显著性能提升。

提出的方法

通过根据预测的深度分布和几何约束对图像进行采样，形成 V_I，从而在体素空间中表示图像。
通过多尺度体素主干在体素空间中表示点云，形成 V_P。
应用体素编码器以在每个模态特定体素空间内实现空间交互。
通过统一体素空间 V_U 中的知识迁移（教师-学生）和特征融合实现跨模态交互。
使用可变形 transformer 解码器在可学习的3D参考点对对象查询进行特征采样，随后进行迭代框体精 Refinement。
通过匈牙利集合到集合的损失进行检测优化，检测还可选使用 L_KT 损失进行跨模态知识迁移。

实验结果

研究问题

RQ1统一的基于体素的表示是否能够有效融合 LiDAR 与摄像头数据以进行3D物体检测？
RQ2保留完整3D体素空间（不进行高度压缩）是否能够改善3D推理并降低语义歧义？
RQ3跨模态知识迁移与模态融合如何影响在单模态和多模态输入下的检测鲁棒性与准确性？
RQ4在一个统一的体素空间中，多帧输入在检测与跟踪中能带来哪些提升？

主要发现

UVTR 在 nuScenes 的验证集/测试上基于 LiDAR 的检测达到 69.7% NDS 与 63.9% mAP，在 nuScenes 测试集的多模态输入达到 71.1% NDS。
基于摄像头的 UVTR-C 在 nuScenes 测试集的多摄像机扫掠达到 55.1% NDS，UVTR-M（多模态）在 nuScenes 测试集达到 71.1% NDS 与 67.1% mAP。
知识迁移与模态融合在不同设置中提供持续的提升，包括使用多模态引导时最高可带来 2.6% 的 NDS 和 1.8% 的 mAP 提升。
多帧输入显著提升性能，当增加 sweeps 时 LiDAR 的 NDS 提升最高可达 18.1%，摄像头端 NDS 提升超过 5%。
UVTR 在跟踪方面也表现出强劲的性能，使用简单的贪婪跟踪器时，例如在 nuScenes 的 LiDAR+Camera 设置下的 UVTR-M 达到 70.1 AMOTA。
该方法在摄像头视角下降和传感器标定噪声下保持鲁棒性，特别是在多模态设置中。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。