QUICK REVIEW

[论文解读] DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion

Chen Wang, Danfei Xu|arXiv (Cornell University)|Jan 15, 2019

Robot Manipulation and Learning参考文献 43被引用 93

一句话总结

DenseFusion 引入了每像素密集融合的 RGB-D 特征，并实现端到端的迭代式6D姿态估计的融合，针对已知对象，达到最先进的结果并在 YCB-Video 和 LineMOD 上实现实时性能。它在 ADD-S<2cm 上比 PoseCNN+ICP 高 3.5%，并且运行速度大约快 200 倍。

ABSTRACT

A key technical challenge in performing 6D object pose estimation from RGB-D image is to fully leverage the two complementary data sources. Prior works either extract information from the RGB image and depth separately or use costly post-processing steps, limiting their performances in highly cluttered scenes and real-time applications. In this work, we present DenseFusion, a generic framework for estimating 6D pose of a set of known objects from RGB-D images. DenseFusion is a heterogeneous architecture that processes the two data sources individually and uses a novel dense fusion network to extract pixel-wise dense feature embedding, from which the pose is estimated. Furthermore, we integrate an end-to-end iterative pose refinement procedure that further improves the pose estimation while achieving near real-time inference. Our experiments show that our method outperforms state-of-the-art approaches in two datasets, YCB-Video and LineMOD. We also deploy our proposed method to a real robot to grasp and manipulate objects based on the estimated pose.

研究动机与目标

在杂乱场景和遮挡条件下推动鲁棒的 RGB-D 6D 姿态估计。
通过像素级融合同时利用颜色和深度，以保留局部几何信息和外观。
通过整合端到端的迭代式细化，消除对缓慢后处理的依赖。
在 YCB-Video 和 LineMOD 数据集上展示最先进的精度。
展示使用估计姿态进行实时机器人抓取的可行性。

提出的方法

分别处理 RGB 和深度以提取密集的逐像素颜色和几何嵌入。
将分割后的深度像素转换为三维点云，并应用类似 PointNet 的几何嵌入网络。
通过密集融合网络对每个像素的颜色与几何进行融合，产生逐像素的姿态假设和置信分数。
使用多项式目标函数进行训练，用学习到的逐像素置信度对逐像素姿态损失进行加权（包括正则化项）。
使用最高置信度的姿态作为最终估计，聚合逐像素预测。
合并一个迭代、可微分的姿态细化模块，它在先前估计的姿态条件下预测姿态残差，从而实现多次细化迭代。

实验结果

研究问题

RQ1密集的逐像素 RGB-D 特征融合是否能在对遮挡的鲁棒性方面优于全局融合方法？
RQ2端到端可微的迭代式细化是否能在不需要缓慢后处理的情况下提升 6D 姿态准确性？
RQ3该方法是否能够在杂乱场景中实现实时推理，并可转用于真实机器人抓取？

主要发现

DenseFusion 的逐像素密集融合显著优于简单拼接融合的基线（如 PointFusion）。
该迭代式细化模块提升姿态精度，尤其是在无纹理的对称物体上（如碗、香蕉）。
该方法对强遮挡具有鲁棒性，随着遮挡增大性能下降很小，在遮挡条件下优于基线方法。
在 YCB-Video 上，迭代变体实现了最佳 ADD-S 性能，在 ADD-S<2cm 上比 PoseCNN+ICP 高 3.5%，且以实时速度运行（约 16 FPS）。
在 LineMOD 上，该方法超越了 prior RGB 方法 with depth refinement，细化在两次迭代后带来额外的精度提升（约 8%）。
一项机器人抓取实验显示，在 60 次抓取尝试中使用估计姿态实现 73% 的成功率。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。