QUICK REVIEW

[论文解读] Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection

Linyan Huang, Zhiqi Li|arXiv (Cornell University)|Oct 24, 2023

Advanced Neural Network Applications被引用 8

一句话总结

Introduces a vision-centric multi-modal expert (VCD-E) and a camera-only apprentice (VCD-A) with trajectory-based distillation and occupancy reconstruction to close the gap between camera-only and multi-modal 3D detectors, achieving state-of-the-art results on nuScenes.

ABSTRACT

Current research is primarily dedicated to advancing the accuracy of camera-only 3D object detectors (apprentice) through the knowledge transferred from LiDAR- or multi-modal-based counterparts (expert). However, the presence of the domain gap between LiDAR and camera features, coupled with the inherent incompatibility in temporal fusion, significantly hinders the effectiveness of distillation-based enhancements for apprentices. Motivated by the success of uni-modal distillation, an apprentice-friendly expert model would predominantly rely on camera features, while still achieving comparable performance to multi-modal models. To this end, we introduce VCD, a framework to improve the camera-only apprentice model, including an apprentice-friendly multi-modal expert and temporal-fusion-friendly distillation supervision. The multi-modal expert VCD-E adopts an identical structure as that of the camera-only apprentice in order to alleviate the feature disparity, and leverages LiDAR input as a depth prior to reconstruct the 3D scene, achieving the performance on par with other heterogeneous multi-modal experts. Additionally, a fine-grained trajectory-based distillation module is introduced with the purpose of individually rectifying the motion misalignment for each object in the scene. With those improvements, our camera-only apprentice VCD-A sets new state-of-the-art on nuScenes with a score of 63.1% NDS.

研究动机与目标

通过缩小与多模态专家之间的域差距，推动基于相机的三维目标检测器的改进。
提出一个以 LiDAR 深度作为深度先验的以视觉为中心的专家，同时在与仅使用相机的模型保持相同体系结构的前提下共享。
开发基于轨迹的蒸馏，以解决长期时间融合中的运动错位问题。
引入占据重建，为深度估计提供密集深度监督以提升深度估计效果。

提出的方法

创建一个以视觉为中心的专家（VCD-E），将图像特征与 LiDAR 深度融合以构建 BEV 表示，同时与学徒模型共享相同架构。
冻结专家并通过辅助损失将中间特征蒸馏到仅使用相机的学徒（VCD-A）中。
基于轨迹的蒸馏：将历史目标轨迹变形至当前帧，采样对齐的 BEV 特征，并计算轨迹化损失以纠正运动错位（L_TD）。
占据重建：将深度反投影到三维空间，构建占据网格，并从专家对学徒应用基于 L1 的深度/占据监督（L_OR）。
联结训练损失：L_Total = L_A + λ1 L_TD + λ2 L_OR，其中 L_A 是学徒感知损失。

实验结果

研究问题

RQ1一个以视觉为中心、具有 LiDAR 深度先验的专家是否能够在保持与视觉模型同质的情况下达到与最先进的多模态方法相当的性能？
RQ2基于轨迹的蒸馏是否在长期时间融合中改善了用于相机-only 检测器的动态对象处理？
RQ3基于占据的深度监督是否提升了 BEV 空间前景对象的深度估计？
RQ4来自以视觉为中心的专家的知识蒸馏对不同骨干网络和时间长度的性能有何影响？

主要发现

Methods	Backbone	Image Size	Frames	mAP ↑	NDS ↑	mATE ↓	mASE ↓	mAOE ↓	mAVE ↓	mAAE ↓
BEVDet	ResNet-50	256 × 704	1	0.298	0.379	0.725	0.279	0.589	0.860	0.245
PETR	ResNet-50	384 × 1056	1	0.313	0.381	0.768	0.278	0.564	0.923	0.225
BEVDet4D	ResNet-50	256 × 704	2	0.322	0.457	0.703	0.278	0.495	0.354	0.206
BEVDepth	ResNet-50	256 × 704	2	0.351	0.475	0.639	0.267	0.479	0.428	0.198
BEVStereo	ResNet-50	256 × 704	2	0.372	0.500	0.598	0.270	0.438	0.367	0.190
STS	ResNet-50	256 × 704	2	0.377	0.489	0.601	0.275	0.450	0.446	0.212
VideoBEV	ResNet-50	256 × 704	8	0.422	0.535	0.564	0.276	0.440	0.286	0.198
SOLOFusion	ResNet-50	256 × 704	16+1	0.427	0.534	0.567	0.274	0.411	0.252	0.188
StreamPETR	ResNet-50	256 × 704	8	0.432	0.540	0.581	0.272	0.413	0.295	0.195
Baseline	ResNet-50	256 × 704	8+1	0.401	0.515	0.595	0.279	0.489	0.291	0.198
VCD-A	ResNet-50	256 × 704	8+1	0.426	0.540	0.547	0.271	0.433	0.268	0.207
Baseline ∗	ResNet-50	256 × 704	8+1	0.418	0.542	0.522	0.267	0.428	0.262	0.188
VCD-A ∗	ResNet-50	256 × 704	8+1	0.446	0.566	0.497	0.260	0.350	0.257	0.203

VCD-E 仅使用图像主干与深度先验即可达到与最先进多模态方法相当的性能，在 nuScenes 验证集上达到 67.7% mAP 和 71.1% NDS。
VCD-A 超越了之前相机端 SOTA 在 nuScenes 验证集上的表现（NDS 0.566，mAP 0.446，使用测试时增强），并在测试集上领先（NDS 0.631，mAP 0.548，Backbone 为 ConvNext-B）。
基于轨迹的蒸馏在时间窗口长度增长时带来显著提升，NDS 提升多达 5.7 点，mAP 提升多达 5.7 点。
占据重建提供密集的三维监督，提升深度预测和目标定位，对整体收益贡献显著。
长期时间融合、基于轨迹的蒸馏和占据监督的结合在 nuScenes 上为相机端检测器带来最先进的结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。