QUICK REVIEW

[论文解读] Multi-view Convolutional Neural Networks for 3D Shape Recognition

Hang Su, Subhransu Maji|arXiv (Cornell University)|May 5, 2015

3D Surveying and Cultural Heritage参考文献 30被引用 279

一句话总结

该论文提出了一种多视角卷积神经网络（MVCNN），通过将多个2D渲染图像作为输入来识别3D形状，其性能优于基于3D表示的方法。通过将多视角特征融合为紧凑的描述符，MVCNN在3D形状分类和基于草图的检索任务中达到当前最优性能，相较于仅使用单视角的先前3D方法，准确率提升了77%至85%。

ABSTRACT

A longstanding question in computer vision concerns the representation of 3D shapes for recognition: should 3D shapes be represented with descriptors operating on their native 3D formats, such as voxel grid or polygon mesh, or can they be effectively represented with view-based descriptors? We address this question in the context of learning to recognize 3D shapes from a collection of their rendered views on 2D images. We first present a standard CNN architecture trained to recognize the shapes' rendered views independently of each other, and show that a 3D shape can be recognized even from a single view at an accuracy far higher than using state-of-the-art 3D shape descriptors. Recognition rates further increase when multiple views of the shapes are provided. In addition, we present a novel CNN architecture that combines information from multiple views of a 3D shape into a single and compact shape descriptor offering even better recognition performance. The same architecture can be applied to accurately recognize human hand-drawn sketches of shapes. We conclude that a collection of 2D views can be highly informative for 3D shape recognition and is amenable to emerging CNN architectures and their derivatives.

研究动机与目标

探究2D图像表示是否能在3D形状识别中优于直接的3D表示学习方法。
开发一种深度学习架构，有效将3D形状的多个2D视角组合为单一、紧凑且具有判别性的描述符。
通过利用学习到的2D表示，实现基于手绘草图的准确3D形状检索。
探索在ImageNet上预训练的CNN是否可用于提升3D形状识别任务的泛化能力。

提出的方法

该方法采用两阶段CNN架构：首先，CNN独立处理每个2D视角，以提取视角特定的特征。
其次，将多个视角的特征在视角间进行池化，并输入到第二个CNN中，生成紧凑且统一的形状描述符。
网络通过在3D形状类别上使用交叉熵损失进行训练，并在训练过程中通过视角抖动进行数据增强。
模型利用在ImageNet上预训练的权重进行特征初始化，随后在3D形状数据集上进行微调。
通过反向传播梯度生成显著性图，以识别最具信息量的视角和视角内的关键区域。
在基于草图的检索中，使用相同的描述符匹配手绘草图与3D形状，无需额外微调。

实验结果

研究问题

RQ12D图像表示的3D形状是否能在形状识别中优于直接的3D表示学习？
RQ2多视角CNN架构在将多个2D投影的信息融合为紧凑且具有判别性的形状描述符方面有多高效？
RQ3学习到的描述符是否能支持基于手绘草图的准确3D形状检索？
RQ4视角选择与视角多样性对识别性能有何影响？
RQ5能否在仅使用2D渲染图像的情况下，有效微调ImageNet上预训练的CNN以用于3D形状识别？

主要发现

仅使用单个2D视角时，MVCNN在ModelNet40数据集上达到85%的top-1准确率，较最佳的先前3D表示方法提升8%。
使用12个视角时，模型在ModelNet40数据集上达到86.4%的top-1准确率，显著优于先前的3D CNN模型。
在未对草图进行任何微调的情况下，使用预训练的VGG-M网络，模型在基于草图的3D形状检索任务中达到36.1%的mAP。
显著性图能够识别最具信息量的视角和判别性区域，例如长椅的正面或浴缸的水龙头部分。
多视角CNN在基于草图的识别基准上优于标准的抖动数据增强方法，证明其在3D形状识别之外也具有有效性。
该模型在真实世界3D物体和基于视频的重建任务中表现出良好的泛化能力，表明其在合成网格之外也具有广泛适用性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。