QUICK REVIEW

[论文解读] Point-E: A System for Generating 3D Point Clouds from Complex Prompts

Alex Nichol, Heewoo Jun|arXiv (Cornell University)|Dec 16, 2022

3D Shape Modeling and Analysis被引用 157

一句话总结

Point E 将一个文本到图像扩散模型用于从提示中渲染一个合成视图，并使用第二个扩散模型在该视图条件下生成有色三维点云，从而实现单GPU采样仅需1–2分钟。

ABSTRACT

While recent work on text-conditional 3D object generation has shown promising results, the state-of-the-art methods typically require multiple GPU-hours to produce a single sample. This is in stark contrast to state-of-the-art generative image models, which produce samples in a number of seconds or minutes. In this paper, we explore an alternative method for 3D object generation which produces 3D models in only 1-2 minutes on a single GPU. Our method first generates a single synthetic view using a text-to-image diffusion model, and then produces a 3D point cloud using a second diffusion model which conditions on the generated image. While our method still falls short of the state-of-the-art in terms of sample quality, it is one to two orders of magnitude faster to sample from, offering a practical trade-off for some use cases. We release our pre-trained point cloud diffusion models, as well as evaluation code and models, at https://github.com/openai/point-e.

研究动机与目标

通过降低文本条件下的3D生成采样时间来实现3D内容创建的民主化。
利用两阶段扩散方法（文本到图像再到图像条件的3D）来处理复杂提示。
生成有色3D点云并提供评估工具和预训练模型。
将基于扩散的3D生成扩展到单GPU上的实际运行时。

提出的方法

使用对3D渲染微调的 GLIDE 模型从文本提示生成合成的渲染视图。
通过基于变换器的扩散模型生成一个低分辨率的有色点云（1,024 点），该点云以合成视图为条件。
在低分辨率云和合成视图的条件下上采样为更细的有色点云（4,096 点）。
在将渲染视图转换为4K有色点云的若干百万3D模型数据集上进行训练。
让点云扩散在 CLIP 相关的图像特征上进行条件化，而非仅仅基于原始文本。
使用回归型 SDF 预测器和行进立方体从点云渲染网格以进行评估。

实验结果

研究问题

RQ1两阶段扩散管道（文本到图像，随后基于图像的3D扩散）是否能够从开放式提示生成连贯的有色3D点云？
RQ2相较于现有方法，采样速度与最终3D质量之间的权衡如何？
RQ3对更丰富的图像表示（CLIP潜在向量网格）的条件化如何影响3D生成的保真度与多样性？
RQ4对于复杂提示，图像条件化的3D扩散有哪些局限性和失效模式？

主要发现

方法	ViT-B/32	ViT-L/14	延迟
DreamFields	78.6%	82.9	~200 V100-hr
CLIP-Mesh	67.8%	74.5%	~17 V100-min
DreamFusion	75.1%	79.7%	~12 V100-hr
Point⋅E (40M, text-only)	15.4%	16.2%	16 V100-sec
Point⋅E (40M)	36.5%	38.8%	1.0 V100-min
Point⋅E (300M)	40.3%	45.6%	1.2 V100-min
Point⋅E (1B)	41.1%	46.8%	1.5 V100-min
Conditioning images	69.6%	86.6%	-

Point⋅E 可以根据文本提示生成多样且复杂的有色3D点云。
增加模型规模和更丰富的图像条件化提升 CLIP R-Precision 与 P-FID/ P-IS 指标。
Point⋅E 在单个GPU上实现显著更快的采样（1–2 分钟），相比一些先前方法存在峰值质量的权衡。
定性结果显示由于对 conditioning 图像的物体形状误解或遮挡部分等原因存在一些失效模式。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。