QUICK REVIEW

[论文解读] PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes

Xiang Yu, Tanner Schmidt|arXiv (Cornell University)|Nov 1, 2017

Human Pose and Action Recognition参考文献 34被引用 130

一句话总结

PoseCNN 是一个 CNN，通过将 3D 平移解耦（通过 2D 中心定位和中心深度）和 3D 旋转（通过四元数回归）来估计 6D 物体姿态，并对对称性使用 ShapeMatch-Loss，在 YCB-Video 和 OccludedLINEMOD 数据集上进行评估。

ABSTRACT

Estimating the 6D pose of known objects is important for robots to interact with the real world. The problem is challenging due to the variety of objects as well as the complexity of a scene caused by clutter and occlusions between objects. In this work, we introduce PoseCNN, a new Convolutional Neural Network for 6D object pose estimation. PoseCNN estimates the 3D translation of an object by localizing its center in the image and predicting its distance from the camera. The 3D rotation of the object is estimated by regressing to a quaternion representation. We also introduce a novel loss function that enables PoseCNN to handle symmetric objects. In addition, we contribute a large scale video dataset for 6D object pose estimation named the YCB-Video dataset. Our dataset provides accurate 6D poses of 21 objects from the YCB dataset observed in 92 videos with 133,827 frames. We conduct extensive experiments on our YCB-Video dataset and the OccludedLINEMOD dataset to show that PoseCNN is highly robust to occlusions, can handle symmetric objects, and provide accurate pose estimation using only color images as input. When using depth data to further refine the poses, our approach achieves state-of-the-art results on the challenging OccludedLINEMOD dataset. Our code and dataset are available at https://rse-lab.cs.washington.edu/projects/posecnn/.

研究动机与目标

旨在在混乱和遮挡中实现鲁棒的 6D 姿态估计，而不過度依赖深度数据。
开发一个端到端 CNN，分别处理平移和旋转估计。
通过专门的损失函数（ShapeMatch-Loss）处理对称对象。
提供一个大规模的 RGB-D 视频数据集（YCB-Video），包含 21 个对象的 6D 姿态注释。

提出的方法

一个两阶段的 CNN 骨干，在任务之间共享特征。
对每个像素进行语义标签，以识别对象类别并实现中心投票。
通过回归每个像素的单位中心方向进行 2D 物体中心定位，并使用霍夫投票层找到 2D 中心。
通过将 2D 中心位置与预测的中心距离（深度）结合，估计 3D 平移以恢复 T。
从对象边界框特征回归到每个类别的四元数的 3D 旋转；对非对称对象使用 PoseLoss 训练，对对称对象使用 ShapeMatch-Loss。
使用深度数据进行 ICP 精 refinement 当可用时。

实验结果

研究问题

RQ1CNN 是否能联合执行语义标签、2D 中心投票和 3D 姿态回归，在混乱场景中实现高精度的 6D 姿态估计？
RQ2如何在旋转回归中有效处理对称性，而无需显式枚举对称方向？
RQ3基于中心投票的平移估计相比直接的 3D 坐标回归，是否对遮挡更鲁棒？
RQ4PoseCNN 在仅彩色输入与 RGB-D 输入在 OccludedLINEMOD 和 YCB-Video 等挑战数据集上的表现如何？

主要发现

PoseCNN 仅从彩色图像实现强大的 6D 姿态估计，在 YCB-Video 上优于基线的 3D 坐标回归。
结合深度通过 ICP 精 refinement 显著提高准确性，常常超越 RGB-D 基线。
ShapeMatch-Loss 能有效处理对称对象，提升 OccludedLINEMOD 中 Eggbox 和 Glue 的姿态估计。
在 OccludedLINEMOD 上，结合 ICP 的 PoseCNN 优于使用 RGB-D 输入的多种最新方法。
YCB-Video 数据集（21 个对象，133,827 帧）为遮挡和对称性提供了强健的训练与评估。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。