QUICK REVIEW

[论文解读] OpenMask3D: Open-Vocabulary 3D Instance Segmentation

Ayça Takmaz, Elisabetta Fedele|arXiv (Cornell University)|Jun 23, 2023

Advanced Neural Network Applications被引用 29

一句话总结

OpenMask3D 通过预测与类别无关的三维掩模并汇聚跨多视图的 CLIP 基于图像特征，形成每个实例的掩模特征，用于开放词汇查询，从而实现零样本开放词汇3D 实例分割。

ABSTRACT

We introduce the task of open-vocabulary 3D instance segmentation. Current approaches for 3D instance segmentation can typically only recognize object categories from a pre-defined closed set of classes that are annotated in the training datasets. This results in important limitations for real-world applications where one might need to perform tasks guided by novel, open-vocabulary queries related to a wide variety of objects. Recently, open-vocabulary 3D scene understanding methods have emerged to address this problem by learning queryable features for each point in the scene. While such a representation can be directly employed to perform semantic segmentation, existing methods cannot separate multiple object instances. In this work, we address this limitation, and propose OpenMask3D, which is a zero-shot approach for open-vocabulary 3D instance segmentation. Guided by predicted class-agnostic 3D instance masks, our model aggregates per-mask features via multi-view fusion of CLIP-based image embeddings. Experiments and ablation studies on ScanNet200 and Replica show that OpenMask3D outperforms other open-vocabulary methods, especially on the long-tail distribution. Qualitative experiments further showcase OpenMask3D's ability to segment object properties based on free-form queries describing geometry, affordances, and materials.

研究动机与目标

为具有新颖对象和自由形式查询的场景，激发并界定开放词汇的三维实例分割任务。
提出一个两阶段的流水线，产生类别无关的三维掩模并计算适合开放词汇查询的每个掩模特征。
证明以实例为中心的 OpenMask3D 在长期尾部类别上表现优于现有开放词汇方法，同时保留长期尾部对象信息。

提出的方法

采用类别无关的 3D 掩模提议头，从重建的点云中获取二值实例掩模。
通过选择实例可见性最高的前 k 视图并从用 SAM 细化的基于裁剪的二维掩模中提取多尺度 CLIP 图像嵌入来计算每个掩模的特征。
跨视图聚合每视图的 CLIP 嵌入，在 CLIP 空间内形成单一的每实例特征，且无需微调。
通过在 CLIP 空间内测量每实例掩模特征与文本/图像嵌入之间的余弦相似度来检索实例，从而实现开放词汇描述。

实验结果

研究问题

RQ1开放词汇的 3D 实例分割是否能够识别并区分超出封闭标签集的对象实例？
RQ2将每个 3D 实例的多视图 CLIP 特征聚合，是否能产生用于开放词汇查询的判别性掩模特征？
RQ3如多尺度裁剪和 2D 掩模细化等设计选择，如何影响开放词汇 3D 实例分割的性能？
RQ4OpenMask3D 对未见/新颖类别以及分布外数据的泛化能力如何？

主要发现

Model	Image Features	AP	AP 50	AP 25	head AP	common AP	tail AP
Mask3D [58]	-	26.9	36.2	41.4	39.8	21.7	17.9
OpenMask3D (Ours)	CLIP [55]	15.4	19.9	23.1	17.1	14.1	14.9

OpenMask3D 在 ScanNet200 与 Replica 上取得的 AP 高于其他开放词汇方法，特别是在尾部类别上。
在 ScanNet200 上，OpenMask3D 结合 CLIP 特征达到 AP 15.4、AP50 19.9、AP25 23.1（head 17.1，common 14.1，tail 14.9）。
在 Replica 上，OpenMask3D 达到 AP 13.1、AP50 18.4、AP25 24.2。
消融结果显示，将二维 SAM 基于掩模的遮罩与多尺度裁剪相结合可获得最佳性能（AP 15.4、AP50 19.9、AP25 23.1）。
OpenMask3D 能泛化到新颖和分布外类别，在若干场景中优于基于 OpenScene 的开放词汇方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。