QUICK REVIEW

[Paper Review] Learning Object Bounding Boxes for 3D Instance Segmentation on Point Clouds

Bo Yang, Jianan Wang|arXiv (Cornell University)|Jun 4, 2019

3D Shape Modeling and Analysis59 references208 citations

TL;DR

3D-BoNet directly regresses 3D bounding boxes and per-point masks for all instances in a point cloud in a single-stage, anchor-free framework, achieving state-of-the-art results on ScanNet and S3DIS with high efficiency.

ABSTRACT

We propose a novel, conceptually simple and general framework for instance segmentation on 3D point clouds. Our method, called 3D-BoNet, follows the simple design philosophy of per-point multilayer perceptrons (MLPs). The framework directly regresses 3D bounding boxes for all instances in a point cloud, while simultaneously predicting a point-level mask for each instance. It consists of a backbone network followed by two parallel network branches for 1) bounding box regression and 2) point mask prediction. 3D-BoNet is single-stage, anchor-free and end-to-end trainable. Moreover, it is remarkably computationally efficient as, unlike existing approaches, it does not require any post-processing steps such as non-maximum suppression, feature sampling, clustering or voting. Extensive experiments show that our approach surpasses existing work on both ScanNet and S3DIS datasets while being approximately 10x more computationally efficient. Comprehensive ablation studies demonstrate the effectiveness of our design.

Motivation & Objective

Motivate efficient 3D instance segmentation directly on raw point clouds without heavy post-processing or dense proposals.
Develop a bounding box prediction module that can handle a variable number of instances and unordered outputs.
Enable precise instance segmentation by coupling object boundaries with per-point mask prediction in a unified framework.

Proposed method

Backbone network extracts per-point local features and a global scene feature from the input point cloud.
Bounding box prediction branch regresses a fixed set H of 3D bounding boxes and corresponding confidence scores from the global feature.
Bounding box association layer solves a Hungarian assignment to match ground-truth boxes with predictions for supervision.
Multi-criteria loss combines Euclidean box distance, soft IoU (sIoU) on points, and cross-entropy score to supervise box prediction.
Point mask prediction branch fuses box, local, and global features to predict a per-instance per-point binary mask, using focal loss for class imbalance.
End-to-end training with a shared backbone (PointNet++), plus a semantic branch trained with standard cross-entropy.

Experimental results

Research questions

RQ1Can a single-stage, anchor-free framework learn accurate 3D bounding boxes for instances directly from point clouds without post-processing?
RQ2Does combining geometric box supervision with point-wise coverage (sIoU) and box confidence improve instance binding to ground-truth instances?
RQ3How well does a simple, shared, box-aware per-point mask branch perform for instance segmentation across diverse object categories?
RQ4What is the computational efficiency gain compared to proposal-based or post-processing-heavy 3D instance segmentation methods?
RQ5Is the framework capable of generalizing to unseen categories due to a class-agnostic mask branch?

Key findings

3D-BoNet surpasses several baselines on ScanNet v2 in AP at IoU 0.5, while being approximately 10x more computationally efficient.
The bounding box association and multi-criteria loss enable reliable pairing between predicted and ground-truth boxes in a variable-instance setting.
The point-mask branch yields competitive instance-level segmentation by reusing global and local features without RoI pooling.
Ablation studies show the box score branch and the full, multi-criteria loss significantly improve performance over single-criterion or no-box supervision configurations.
On S3DIS, 3D-BoNet achieves higher mean precision and comparable recall relative to PartNet and ASIS baselines, with the full framework providing best performance.
Computation analysis indicates the method operates in O(N) time, with practical GPU times around 20 ms for 4k points, significantly faster than clustering or dense proposal methods.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.