QUICK REVIEW

[论文解读] PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning

Xiangyang Zhu, Renrui Zhang|arXiv (Cornell University)|Nov 21, 2022

Domain Adaptation and Few-Shot Learning被引用 22

一句话总结

PointCLIP V2 将 CLIP 与 GPT 融合，以在无需 3D 训练的情况下实现零样本和少样本的 3D 分类、分割和检测，利用真实感投影和 3D 感知的 GPT 提示来弥合 2D-3D-语言的差距。

ABSTRACT

Large-scale pre-trained models have shown promising open-world performance for both vision and language tasks. However, their transferred capacity on 3D point clouds is still limited and only constrained to the classification task. In this paper, we first collaborate CLIP and GPT to be a unified 3D open-world learner, named as PointCLIP V2, which fully unleashes their potential for zero-shot 3D classification, segmentation, and detection. To better align 3D data with the pre-trained language knowledge, PointCLIP V2 contains two key designs. For the visual end, we prompt CLIP via a shape projection module to generate more realistic depth maps, narrowing the domain gap between projected point clouds with natural images. For the textual end, we prompt the GPT model to generate 3D-specific text as the input of CLIP's textual encoder. Without any training in 3D domains, our approach significantly surpasses PointCLIP by +42.90%, +40.44%, and +28.75% accuracy on three datasets for zero-shot 3D classification. On top of that, V2 can be extended to few-shot 3D classification, zero-shot 3D part segmentation, and 3D object detection in a simple manner, demonstrating our generalization ability for unified 3D open-world learning.

研究动机与目标

在不进行 3D 领域训练的情况下，推动对 3D 开放世界的理解。
通过真实投影和 GPT 生成文本，将 2D 视觉-语言模型桥接到 3D。
实现零样本和少样本的 3D 分类、分割与检测。
在一个统一框架内展示对多种 3D 任务的泛化能力。

提出的方法

使用 Realistic Projection 流水线（Quantize、Densify、Smooth、Squeeze）将 3D 点云投影为深度图，以提示 CLIP。
以面向 3D 的命令对 GPT-3 进行提示，生成用于 CLIP 文本编码器的丰富 3D 专用文本。
将多视角深度图与 GPT 生成的 3D 文本对齐，以实现对 3D 数据的更好图文对齐。
将该框架扩展到零-shot/少-shot 的 3D 分类、零-shot 的 3D 部分分割，以及零-shot 的 3D 物体检测。
可选地添加可学习的平滑和 3D 卷积模块以实现少样本自适应，同时保持 CLIP 编码器冻结。

实验结果

研究问题

RQ1是否可以同时提示 CLIP 与 GPT，在不进行 3D 领域训练的情况下实现统一的 3D 开放世界理解？
RQ2如何通过真实投影和 3D 感知文本提示将 3D 数据转化为对 CLIP 友好的形式？
RQ3在这一统一框架下，零样本和少样本的 3D 分类/分割/检测的性能如何？

主要发现

零样本 3D 分类增益：ModelNet10 73.13%，ModelNet40 64.22%，ScanObjectNN PB_T50_RS 35.36%。
相对 PointCLIP 的改进：ModelNet10 增加 +42.90%，ModelNet40 增加 +40.44%，PB_T50_RS 增加 +28.75%。
在 ScanNet V2 上的零样本 3D 检测达到 AP 25 为 18.97%，AP 50 为 11.53%。
在 ShapeNetPart 的零样本 3D 部分分割相对于 PointCLIP（mIoU I）实现平均 IoU 提升 +17.4%。
少样本结果在最少 3D 训练下仍表现强劲，接近 ModelNet40 的全监督基线，16-shot 精度为 89.55%（例如）。
消融研究表明 Realistic Projection 和 3D 感知的 GPT 提示对增益至关重要。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。