QUICK REVIEW

[论文解读] PVT: Point-Voxel Transformer for 3D Deep Learning

Cheng Zhang, Haocheng Wan|arXiv (Cornell University)|Aug 13, 2021

Human Pose and Action Recognition参考文献 56被引用 24

一句话总结

PVT 提出了一种新颖的 3D 深度学习架构，通过融合基于体素和基于点的多头自注意力机制，高效捕捉粗粒度与细粒度的 3D 特征。通过在体素中应用自注意力以提升计算效率，在点中应用自注意力以保留全局上下文，并采用循环移位的分块方案以减少计算量，PVT 在 ModelNet40 上实现了 94.0% 的最先进准确率（无需投票），相比先前的 Transformer 模型推理速度提升了 7 倍。

ABSTRACT

In this paper, we present an efficient and high-performance neural architecture, termed Point-Voxel Transformer (PVT)for 3D deep learning, which deeply integrates both 3D voxel-based and point-based self-attention computation to learn more discriminative features from 3D data. Specifically, we conduct multi-head self-attention (MSA) computation in voxels to obtain the efficient learning pattern and the coarse-grained local features while performing self-attention in points to provide finer-grained information about the global context. In addition, to reduce the cost of MSA computation with high efficiency, we design a cyclic shifted boxing scheme by limiting the MSA computation to non-overlapping local box and also preserving cross-box connection. Evaluated on classification benchmark, our method not only achieves state-of-the-art accuracy of 94.0% (no voting) but outperforms previous Transformer-based models with 7x measured speedup on average. On part and semantic segmentation, our model also obtains strong performance(86.5% and 68.2% mIoU, respectively). For 3D object detection task, we replace the primitives in Frustrum PointNet with PVT block and achieve an improvement of 8.6% AP.

研究动机与目标

为解决现有仅依赖体素或点的 3D 深度学习模型在效率和表征能力方面的局限性。
整合基于体素和基于点的自注意力机制的优势，以提升 3D 数据中的特征学习能力。
通过优化的空间划分方案，降低 3D Transformer 中多头自注意力的计算成本。
在多个 3D 视觉基准测试中实现高性能，包括分类、分割和检测。

提出的方法

PVT 在体素中执行多头自注意力（MSA）计算，以捕捉粗粒度的局部特征并降低计算成本。
同时在原始点云中应用自注意力，以保留细粒度的几何细节和全局上下文。
采用循环移位的分块方案，将 3D 空间划分为非重叠的局部块，以限制 MSA 计算量，同时保持块间的连接性。
通过跨模态注意力机制融合体素和点的特征，以增强特征表征能力。
将 Frustrum PointNet 中的原始层替换为 PVT 块，以提升 3D 目标检测性能。
该设计实现了高效的推理，显著降低 FLOPs，获得显著的速度提升而无需牺牲准确率。

实验结果

研究问题

RQ1结合基于体素和基于点的自注意力是否能提升深度神经网络中 3D 特征学习的能力？
RQ2如何在不损失全局上下文的前提下，使 3D 空间中的多头自注意力计算更加高效？
RQ3采用循环移位的空间分块对注意力计算和模型性能有何影响？
RQ4混合的体素-点注意力机制是否能在标准 3D 基准上超越纯体素或纯点的 Transformer 模型？
RQ5所提出的架构在分类、分割和检测等多样化 3D 视觉任务中如何实现可扩展性？

主要发现

PVT 在 ModelNet40 分类基准上实现了 94.0% 的最先进准确率，且未使用测试时投票。
与先前的基于 Transformer 的模型相比，该模型在相同基准上的推理平均速度提升了 7 倍。
在部件分割任务中，PVT 达到了 86.5% 的 mIoU，展现出在细粒度 3D 理解方面的强大性能。
在语义分割任务中，模型获得了 68.2% 的 mIoU，表明其在复杂场景中具备稳健的特征学习能力。
当用 PVT 块替换 Frustrum PointNet 中的原始层时，模型在 AP 指标上将 3D 目标检测性能提升了 8.6%。
循环移位的分块方案有效降低了 MSA 计算成本，同时通过保持块间连接性维持了模型性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。