QUICK REVIEW

[论文解读] MonoDistill: Learning Spatial Features for Monocular 3D Object Detection

Zhiyu Chong, Xinzhu Ma|arXiv (Cornell University)|Jan 26, 2022

Advanced Neural Network Applications被引用 59

一句话总结

MonoDistill 将来自基于 LiDAR 的教师的空间线索传递给单目检测器，通过将 LiDAR 信号投影到图像平面，在不增加推理成本的情况下改进单目 3D 检测。

ABSTRACT

3D object detection is a fundamental and challenging task for 3D scene understanding, and the monocular-based methods can serve as an economical alternative to the stereo-based or LiDAR-based methods. However, accurately detecting objects in the 3D space from a single image is extremely difficult due to the lack of spatial cues. To mitigate this issue, we propose a simple and effective scheme to introduce the spatial information from LiDAR signals to the monocular 3D detectors, without introducing any extra cost in the inference phase. In particular, we first project the LiDAR signals into the image plane and align them with the RGB images. After that, we use the resulting data to train a 3D detector (LiDAR Net) with the same architecture as the baseline model. Finally, this LiDAR Net can serve as the teacher to transfer the learned knowledge to the baseline model. Experimental results show that the proposed method can significantly boost the performance of the baseline model and ranks the $1^{st}$ place among all monocular-based methods on the KITTI benchmark. Besides, extensive ablation studies are conducted, which further prove the effectiveness of each part of our designs and illustrate what the baseline model has learned from the LiDAR Net. Our code will be released at \url{https://github.com/monster-ghost/MonoDistill}.

研究动机与目标

通过利用 LiDAR 的空间线索在不增加推理成本的情况下提升单目 3D 目标检测的动机。
提出一种蒸馏框架，将 LiDAR 派生的地图与 RGB 输入对齐以实现有效知识传递。
在图像类 LiDAR 地图上训练一个 LiDAR 基于的教师网络，并将指导蒸馏传递给单目学生网络。
证明三种蒸馏方案和基于注意力的融合在提升检测性能方面的有效性。

提出的方法

通过将 LiDAR 点投影到图像平面并通过插值生成密集深度，生成图像类 LiDAR 地图。
使用与学生基线（MonoDLE）相同的架构训练一个基于 LiDAR 的教师网络。
应用三种蒸馏方案将空间线索从教师传递给学生：场景级特征亲和蒸馏、对象级特征空间蒸馏、对象级结果空间蒸馏。
使用基于注意力的融合模块增强特征空间蒸馏。
端到端训练，损失函数为 L = L_src + lambda1*L_sf + lambda2*L_of + lambda3*L_or；教师仅使用 L_src。

实验结果

研究问题

RQ1由 LiDAR 基于的教师学习的空间线索在不改变学生体系结构或增加推理成本的情况下，能否提升单目 3D 检测？
RQ2哪些蒸馏流（场景级、特征空间中的对象级、结果空间中的对象级）最有效地传递空间信息？
RQ3通过投影将 LiDAR 派生的地图与 RGB 数据对齐是否比将深度估计作为中间任务使用更好的监督信号？
RQ4与状态-of-the-art 单目检测器相比，该方法在 KITTI 上的表现如何？

主要发现

完整的 MonoDistill 方法在 KITTI 验证集和测试集的 3D 和 BEV 指标上相较基线取得持续改进。
在 KITTI 验证集上，该方法在 IOU 0.7 时的 3D AP 提升为 3.34（中等），5.02（简单）和 2.98（困难）；BEV 提升为 5.16（中等）、6.62（简单）和 3.87（困难）。
在 KITTI 测试集上，该方法在 3D 和 BEV 指标上相较 prior 单目方法取得显著增益，且每张图像运行时间约 40 ms，快于若干基于深度的方法。
消融实验表明三种蒸馏方案均有贡献；对前景区域的引导和基于区域的标签的效果优于全图或稀疏像素引导。
跨模型分析表明教师的互补空间信息起驱动作用的并非教师更高的准确度，而是深度估计作为中间任务相较直接 LiDAR 到检测器的引导会导致信息损失。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。