QUICK REVIEW

[论文解读] Category Level Object Pose Estimation via Neural Analysis-by-Synthesis

Chen Xu, Zijian Dong|arXiv (Cornell University)|Jan 1, 2020

Advanced Vision and Imaging参考文献 57被引用 3

一句话总结

该论文提出了一种用于类别级6DoF物体位姿估计的神经分析-合成框架，无需为每个物体实例提供特定的CAD模型。通过训练一个可微分的神经图像生成网络，根据位姿、形状和外观编码生成图像，该方法实现了通过感知损失进行梯度优化，从而在仅使用RGB和RGB-D基准测试中达到最先进精度。

ABSTRACT

Many object pose estimation algorithms rely on the analysis-by-synthesis framework which requires explicit representations of individual object instances. In this paper we combine a gradient-based fitting procedure with a parametric neural image synthesis module that is capable of implicitly representing the appearance, shape and pose of entire object categories, thus rendering the need for explicit CAD models per object instance unnecessary. The image synthesis network is designed to efficiently span the pose configuration space so that model capacity can be used to capture the shape and local appearance (i.e., texture) variations jointly. At inference time the synthesized images are compared to the target via an appearance based loss and the error signal is backpropagated through the network to the input parameters. Keeping the network parameters fixed, this allows for iterative optimization of the object pose, shape and appearance in a joint manner and we experimentally show that the method can recover orientation of objects with high accuracy from 2D images alone. When provided with depth measurements, to overcome scale ambiguities, the method can accurately recover the full 6DOF pose successfully.

研究动机与目标

解决现有6DoF位姿估计方法依赖于每个物体实例的显式3D CAD模型的局限性。
仅使用单张RGB或RGB-D图像，在测试时无需访问每个实例的模型，实现准确的类别级位姿估计。
开发一个神经图像生成模块，能够隐式表示整个物体类别中形状、外观和位姿的变化。
将神经生成模块集成到基于梯度的优化框架中，联合恢复位姿、形状和外观参数。
在训练过程中不使用数据增强的情况下，证明方法对光照、遮挡和检测误差等域偏移因素的鲁棒性。

提出的方法

训练一个深度神经网络，根据3D位姿、形状编码和外观编码生成物体类别的2D图像。
使用3D条件变分自编码器（VAE）结合3D体素体积来建模形状和位姿空间，实现对位姿配置的连续且高效的遍历。
通过编码器网络从输入图像中提取的潜在编码来条件化图像生成。
通过在固定网络中反向传播感知损失梯度，迭代优化位姿、形状和外观参数。
利用预训练VGG网络的特征构建感知损失，以促进语义对齐而非像素级相似性。
在RGB-D设置中引入深度测量，以解决RGB-only估计中的尺度模糊性，并恢复完整的6DoF位姿（3D平移和3D旋转）。

实验结果

研究问题

RQ1神经图像生成模块是否能够隐式表示包含联合形状、外观和位姿变化的整个物体类别，从而消除对显式CAD模型的需求？
RQ2通过可微分神经渲染器进行基于梯度的优化，是否能够实现从单张RGB或RGB-D图像中准确估计6DoF位姿？
RQ3与最先进方法相比，该方法在位姿精度和鲁棒性方面表现如何，特别是在RGB-only和RGB-D基线中？
RQ4该方法在未见物体实例以及光照、遮挡和检测误差等域偏移因素下的泛化能力如何？
RQ5不同损失函数（如感知损失、L1、L2、SSIM）及正则化对优化稳定性与最终位姿精度有何影响？

主要发现

仅使用RGB图像时，该方法在YCB数据集上实现了97.1%的AP60，部分情况下优于强基准的RGB-D方法。
在RGB-D输入下，该方法能准确恢复完整的6DoF位姿，解决了RGB-only估计中固有的尺度模糊性。
感知损失优于L1、L2和SSIM损失，实现了最高的AP60（97.1%）和最低的旋转误差，归因于更好的语义对齐。
消融研究显示，若移除VAE或3D体素体积，将导致无法生成新样本或位姿精度显著下降，凸显其必要性。
即使在未使用数据增强的情况下，该方法在域偏移（如光照、遮挡、检测误差）下仍保持低误差，展现出强大的泛化能力。
由于对形状和外观的生成建模，该方法显著优于判别式位姿回归基线，尤其在具有挑战性的条件下。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。