QUICK REVIEW

[论文解读] RTM3D: Real-time Monocular 3D Detection from Object Keypoints for Autonomous Driving

Peixuan Li, Huaici Zhao|arXiv (Cornell University)|Jan 10, 2020

Advanced Neural Network Applications参考文献 46被引用 51

一句话总结

RTM3D 提出了一种单阶段的单目三维目标检测器，通过将三维框投影得到的九个关键点来恢复 3D 姿态、尺度和位置，使用几何重投影约束，在 KITTI 上实现实时性能且不需要额外监督数据。

ABSTRACT

In this work, we propose an efficient and accurate monocular 3D detection framework in single shot. Most successful 3D detectors take the projection constraint from the 3D bounding box to the 2D box as an important component. Four edges of a 2D box provide only four constraints and the performance deteriorates dramatically with the small error of the 2D detector. Different from these approaches, our method predicts the nine perspective keypoints of a 3D bounding box in image space, and then utilize the geometric relationship of 3D and 2D perspectives to recover the dimension, location, and orientation in 3D space. In this method, the properties of the object can be predicted stably even when the estimation of keypoints is very noisy, which enables us to obtain fast detection speed with a small architecture. Training our method only uses the 3D properties of the object without the need for external networks or supervision data. Our method is the first real-time system for monocular image 3D detection while achieves state-of-the-art performance on the KITTI benchmark. Code will be released at https://github.com/Banconxuan/RTM3D.

研究动机与目标

通过使用图像线索而非激光雷达或大量外部数据，推动自动驾驶的实时单目三维检测。
将3D边界框估计表述为在透视投影下的关键点检测和能量最小化问题。
开发一个专门用于3D关键点检测的快速的一阶段网络，而不依赖额外的网络或标注。
通过几何优化流程提高对噪声关键点和小的2D定位误差的鲁棒性。

提出的方法

用单阶段CNN在图像中预测3D边界框的九个透视关键点（八个顶点和中心）。
使用新颖的 Keypoint Feature Pyramid Network (KFPN) 以创建一个尺度不变的多尺度关键点响应，而不依赖于用于多尺度框的2D FPN。
将3D框估计表述为在 SE(3) 上的非线性最小二乘优化，将相机-点重投影误差与对尺寸、深度和姿态的可选先验相结合。
用网络预测的先验来初始化几何优化，以便在 g2o 中用高斯-牛顿/勒文贝格-马夸尔特法快速收敛。
通过焦点损失训练关键点热力图，以及对尺寸、深度和偏移量的回归损失，而不需要外部监督数据。
结合能量函数同时优化投影一致性和先验信息，以提高准确性和速度。

实验结果

研究问题

RQ1仅使用透视几何且不使用外部深度数据，基于关键点表示的单目 RGB 图像能否恢复准确的3D边界框？
RQ2在 KITTI 上，一阶段、基于关键点的检测器结合几何优化阶段，是否在实时运行的同时达到或超过基于图像的3D检测器？
RQ3可选的先验（尺寸、朝向、深度）和关键点偏移量如何影响3D检测精度和推理速度？

主要发现

该方法仅使用 RGB 图像即可在 KITTI 上实现实时性能。
预测九个二维关键点（八个3D边框顶点加中心）提供了18个几何约束，足以恢复3D属性。
具有KFPN的单阶段关键点网络及一个几何重投影能量函数，在相似速度下超过许多基于图像的方法在 AP3D 和 APBEV。
结合尺寸、姿态、深度先验和关键点偏移可以提高准确性，并由于优化的良好初始化而保持快速推理。
KFPN 在 Easy/Moderate/Hard 上提高了3D AP分数，且运行时间仅有适度变化。
与立体声/LiDAR 基础的方法相比，RTM3D 在单目方法中以显著更高的速度提供具有竞争力的3D检测准确性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。