Skip to main content
QUICK REVIEW

[Paper Review] Point-Voxel CNN for Efficient 3D Deep Learning

Zhijian Liu, Haotian Tang|arXiv (Cornell University)|Jul 8, 2019
3D Shape Modeling and Analysis49 references335 citations
TL;DR

PVCNN combines a low-resolution voxel-based branch with a high-resolution point-based branch to achieve faster, memory-efficient 3D deep learning while maintaining high accuracy.

ABSTRACT

We present Point-Voxel CNN (PVCNN) for efficient, fast 3D deep learning. Previous work processes 3D data using either voxel-based or point-based NN models. However, both approaches are computationally inefficient. The computation cost and memory footprints of the voxel-based models grow cubically with the input resolution, making it memory-prohibitive to scale up the resolution. As for point-based networks, up to 80% of the time is wasted on structuring the sparse data which have rather poor memory locality, not on the actual feature extraction. In this paper, we propose PVCNN that represents the 3D input data in points to reduce the memory consumption, while performing the convolutions in voxels to reduce the irregular, sparse data access and improve the locality. Our PVCNN model is both memory and computation efficient. Evaluated on semantic and part segmentation datasets, it achieves much higher accuracy than the voxel-based baseline with 10x GPU memory reduction; it also outperforms the state-of-the-art point-based models with 7x measured speedup on average. Remarkably, the narrower version of PVCNN achieves 2x speedup over PointNet (an extremely efficient model) on part and scene segmentation benchmarks with much higher accuracy. We validate the general effectiveness of PVCNN on 3D object detection: by replacing the primitives in Frustrum PointNet with PVConv, it outperforms Frustrum PointNet++ by 2.4% mAP on average with 1.5x measured speedup and GPU memory reduction.

Motivation & Objective

  • Motivate the need for efficient 3D deep learning on edge devices due to memory and latency constraints.
  • Propose a hybrid PVConv primitive that fuses voxel-based and point-based processing to reduce memory footprint and improve data locality.
  • Demonstrate that PVCNN achieves higher accuracy with lower memory and latency compared with pure voxel- or point-based models across multiple 3D tasks.

Proposed method

  • Introduce Point-Voxel Convolution (PVConv) with two branches: a voxel-based branch for coarse neighborhood aggregation and a high-resolution point-based branch for fine-grained features.
  • Voxel-based branch voxelizes normalized points into low-resolution grids, applies 3D convolutions, and devoxelizes via trilinear interpolation to fuse with point features.
  • Point-based branch processes original points with an MLP to preserve high-resolution, per-point information.
  • Fuse features from both branches via simple addition to obtain final point features.
  • Normalize coordinates, perform differentiable voxelization/devoxelization to enable end-to-end training.

Experimental results

Research questions

  • RQ1How can 3D data be processed efficiently without sacrificing accuracy on common 3D tasks (segmentation, detection)?
  • RQ2Does a hybrid voxel-point approach reduce memory footprint and improve data locality compared with pure voxel or pure point methods?
  • RQ3What is the performance (accuracy, latency, memory) of PVCNN on ShapeNet Part, S3DIS, and KITTI benchmarks?

Key findings

  • PVCNN achieves higher accuracy than voxel baselines with substantially lower GPU memory (about 10x reduction in memory for ShapeNet Part).
  • PVCNN attains about 7x speedup over state-of-the-art point-based models on average across tested tasks.
  • Narrow PVCNN variants reach 2x to 15x speedups over strong baselines (e.g., PointNet, SpiderCNN) with competitive or higher accuracy.
  • On ShapeNet Part, PVCNN variants show favorable accuracy-latency-memory trade-offs, e.g., 1xC variant achieves 86.2 mIoU with 50.7 ms latency and 1.59 GB memory.
  • On S3DIS indoor scene segmentation, PVCNN and PVCNN++ outperform pure point-based models with up to 8x speedup and 3x memory reduction; PVCNN++ surpasses PointCNN with lower latency.
  • For 3D object detection (KITTI), PVCNN variants outperform F-PointNet++ with 1.5x Faster measured speed and memory reductions, and complete PVCNN yields notable mAP improvements.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.