QUICK REVIEW

[论文解读] BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of Challenging Objects without Using Depth

Mahdi Rad, Vincent Lepetit|arXiv (Cornell University)|Mar 31, 2017

Advanced Neural Network Applications参考文献 15被引用 71

一句话总结

该论文提出BB8，一种仅使用彩色图像的可扩展、鲁棒的3D物体位姿估计方法，通过基于整体CNN的3D边界框角点投影预测实现。该方法在LINEMOD数据集上达到最先进性能（89.3%准确率），并在T-LESS数据集上设立新基准（6D位姿标准下54%），无深度信息支持下，通过位姿范围分类和训练数据范围限制来处理对称物体。

ABSTRACT

We introduce a novel method for 3D object detection and pose estimation from color images only. We first use segmentation to detect the objects of interest in 2D even in presence of partial occlusions and cluttered background. By contrast with recent patch-based methods, we rely on a holistic approach: We apply to the detected objects a Convolutional Neural Network (CNN) trained to predict their 3D poses in the form of 2D projections of the corners of their 3D bounding boxes. This, however, is not sufficient for handling objects from the recent T-LESS dataset: These objects exhibit an axis of rotational symmetry, and the similarity of two images of such an object under two different poses makes training the CNN challenging. We solve this problem by restricting the range of poses used for training, and by introducing a classifier to identify the range of a pose at run-time before estimating it. We also use an optional additional step that refines the predicted poses. We improve the state-of-the-art on the LINEMOD dataset from 73.7% to 89.3% of correctly registered RGB frames. We are also the first to report results on the Occlusion dataset using color images only. We obtain 54% of frames passing the Pose 6D criterion on average on several sequences of the T-LESS dataset, compared to the 67% of the state-of-the-art on the same sequences which uses both color and depth. The full approach is also scalable, as a single network can be trained for multiple objects simultaneously.

研究动机与目标

开发一种在仅使用RGB图像的前提下，对具有挑战性的对称物体在部分遮挡条件下仍能可靠工作的3D物体位姿估计方法。
解决由于位姿模糊性和图像相似性导致在对称物体上训练CNN的困难。
在杂乱场景和遮挡条件下提升鲁棒性和准确性，且不依赖深度传感器。
实现单个网络对多个物体的同时可扩展训练。

提出的方法

该方法使用实例分割在2D中检测物体，即使在部分遮挡和杂乱环境下也能有效工作。
一个整体CNN从分割出的物体区域中预测3D边界框角点的2D投影。
为处理旋转对称性，训练数据被限制在有限的位姿范围内，以减少模糊性。
在推理阶段引入位姿范围分类器，以在回归前识别正确的位姿范围。
可选的优化步骤通过迭代优化进一步提升位姿精度。
整个系统端到端训练，支持单个网络的多物体推理。

实验结果

研究问题

RQ1基于CNN的方法能否仅从RGB图像中实现对高度对称、遮挡物体的高精度3D位姿估计？
RQ2在训练和推理过程中，如何缓解由旋转对称性引起的位姿模糊性？
RQ3在T-LESS和LINEMOD等基准数据集上，不使用深度数据能实现多大的性能提升？
RQ4能否有效训练单个网络以同时处理多个物体，同时保持高精度和鲁棒性？

主要发现

该方法在LINEMOD数据集上达到89.3%的准确率，将此前最先进方法的73.7%准确率显著提升。
首次报告了仅使用彩色图像在T-LESS数据集遮挡子集上的结果。
在T-LESS数据集上，BB8在6D位姿标准下达到54%的帧通过率，优于此前使用彩色和深度信息的最先进方法。
该方法具有可扩展性，单个网络可同时训练以估计多个物体的位姿。
使用位姿范围分类和受限的训练位姿范围显著提升了对称物体上的泛化能力。
可选的优化步骤进一步提升了位姿精度，证明了该方法的适应性和鲁棒性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。