QUICK REVIEW

[论文解读] DiT-3D: Exploring Plain Diffusion Transformers for 3D Shape Generation

Shentong Mo, Enze Xie|arXiv (Cornell University)|Jul 4, 2023

3D Shape Modeling and Analysis被引用 12

一句话总结

DiT-3D 通过对体素化点云去噪，引入一个简单的扩散Transformer用于3D点云生成，在 ShapeNet 上实现最新的结果，且通过参数高效的 2D→3D 预训练和 3D 窗口注意力达到高效性。

ABSTRACT

Recent Diffusion Transformers (e.g., DiT) have demonstrated their powerful effectiveness in generating high-quality 2D images. However, it is still being determined whether the Transformer architecture performs equally well in 3D shape generation, as previous 3D diffusion methods mostly adopted the U-Net architecture. To bridge this gap, we propose a novel Diffusion Transformer for 3D shape generation, namely DiT-3D, which can directly operate the denoising process on voxelized point clouds using plain Transformers. Compared to existing U-Net approaches, our DiT-3D is more scalable in model size and produces much higher quality generations. Specifically, the DiT-3D adopts the design philosophy of DiT but modifies it by incorporating 3D positional and patch embeddings to adaptively aggregate input from voxelized point clouds. To reduce the computational cost of self-attention in 3D shape generation, we incorporate 3D window attention into Transformer blocks, as the increased 3D token length resulting from the additional dimension of voxels can lead to high computation. Finally, linear and devoxelization layers are used to predict the denoised point clouds. In addition, our transformer architecture supports efficient fine-tuning from 2D to 3D, where the pre-trained DiT-2D checkpoint on ImageNet can significantly improve DiT-3D on ShapeNet. Experimental results on the ShapeNet dataset demonstrate that the proposed DiT-3D achieves state-of-the-art performance in high-fidelity and diverse 3D point cloud generation. In particular, our DiT-3D decreases the 1-Nearest Neighbor Accuracy of the state-of-the-art method by 4.59 and increases the Coverage metric by 3.51 when evaluated on Chamfer Distance.

研究动机与目标

探究纯扩散Transformer是否能够匹配基于U-Net的3D生成方法，以获得高保真度的3D点云。
开发直接在体素化点云上工作的扩散Transformer，以实现基于去噪的3D生成。
纳入3D特定的适配（3D位置/补丁嵌入、3D窗口注意力）以管理3D令牌的增长。
展示从2D ImageNet预训练进行参数高效微调，以实现跨模态迁移（2D→3D）与领域迁移（类到类）。
在ShapeNet上通过消融研究展示对补丁大小、体素大小和模型规模的可扩展性。

提出的方法

用一个简单的扩散Transformer替换U-Net，以在体素化点云上进行3D形状生成。
利用点云的体素化、3D补丁嵌入以及3D正弦-余弦位置嵌入来形成令牌。
应用3D窗口注意力，将自注意力复杂度从O(L^2)降低为O(L^2/R^3)。
将Transformer输出去体素化，以在原始点空间预测去噪后的点云。
利用参数高效微调（DiffFit）从2D ImageNet 预训练的DiT权重初始化，以实现模态迁移并执行领域迁移（类到类）。
使用 DDPM 目标训练（对预测噪声的简单损失），并通过可学习的类别嵌入支持多类别条件。

实验结果

研究问题

RQ1纯扩散Transformer能否在体素化的3D点云上有效工作以实现高保真形状生成？
RQ2哪些3D特定的适配（位置/补丁嵌入、窗口注意力）对3D扩散Transformer做得好至关重要？
RQ32D ImageNet预训练是否对3D生成具有可迁移的好处，且参数高效微调是否能实现跨模态迁移？
RQ4在不同体素大小、补丁大小和模型规模下，DiT-3D 架构在保持质量与多样性方面有多大可扩展性？
RQ53D设计组件（体素扩散、3D嵌入、窗口注意力）对生成效率和指标的影响是什么？

主要发现

DiT-3D 在 ShapeNet 上相对于先前的非 DDPM 和 DDPM 基线，在3D点云生成方面达到最先进水平。
在消融实验中，体素扩散、3D位置嵌入和3D窗口注意力共同降低训练成本并提升1-NNA和COV指标。
使用DiffFit风格微调的2D ImageNet预训练相较从零开始训练带来可衡量的收益，并在模态迁移方面实现显著的参数减少。
领域迁移实验表明，在一个类别上训练（例如椅子）并仅使用0.09 MB完成微调即可在其他类别上达到有竞争力的质量与多样性。
该方法可扩展到补丁大小、体素大小和模型规模，在他们的研究中，较小的补丁大小（例如2）和较大体素大小可获得更好的结果。
DiT-3D 实现了高效微调和跨域/跨模态迁移，在 Chair、Airplane 和 Car 的若干指标上优于 MeshDiffusion 和 LION。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。