QUICK REVIEW

[論文レビュー] DMV3D: Denoising Multi-View Diffusion using 3D Large Reconstruction Model

Yinghao Xu, Hao Tan|arXiv (Cornell University)|Nov 15, 2023

Generative Adversarial Networks and Image Synthesis被引用数 18

ひとこと要約

DMV3D は、大規模 Transformer ベースの 3D 再構成デノイザーを備えた単段階・カテゴリ非依存の拡散モデルで、三平面 NeRF を生み出す。これにより、3D スーパービジョンなしで、テキストまたは画像条件付きの高速（約 30 秒）3D 生成が可能となる。

ABSTRACT

We propose extbf{DMV3D}, a novel 3D generation approach that uses a transformer-based 3D large reconstruction model to denoise multi-view diffusion. Our reconstruction model incorporates a triplane NeRF representation and can denoise noisy multi-view images via NeRF reconstruction and rendering, achieving single-stage 3D generation in $\sim$30s on single A100 GPU. We train extbf{DMV3D} on large-scale multi-view image datasets of highly diverse objects using only image reconstruction losses, without accessing 3D assets. We demonstrate state-of-the-art results for the single-image reconstruction problem where probabilistic modeling of unseen object parts is required for generating diverse reconstructions with sharp textures. We also show high-quality text-to-3D generation results outperforming previous 3D diffusion models. Our project website is at: https://justimyhxu.github.io/projects/dmv3d/ .

研究の動機と目的

Aim to achieve fast, realistic, and generic 3D generation without per-asset optimization or 3D supervision.
Leverage a 3D large reconstruction model as a multi-view denoiser within a diffusion framework.
Train on large-scale multi-view image datasets using only image-space supervision.
Produce diverse high-fidelity 3D assets from text or a single image.
Demonstrate state-of-the-art 3D reconstruction quality and competitive text-to-3D results.

提案手法

Use a 2D multi-view diffusion model whose denoiser is a 3D reconstruction module that outputs a clean triplane NeRF from noisy multi-view inputs.
Represent the 3D scene as a triplane NeRF and render with differentiable volume rendering to supervise reconstruction (via novel-view renderings).
Condition the denoiser on diffusion time step (time conditioning) and camera rays (Plucker coordinates) to handle noise and diverse viewpoints.
Extend to image conditioning (using one clean view as input and denoising others) and text conditioning (via CLIP embeddings and cross-attention) for controllable 3D generation.
Train with a reconstruction loss that includes both input and novel-view renderings to enforce 3D consistency (Equation L_recon).
Base the denoiser on the large reconstruction model (LRM) architecture with transformer-based triplane-to-image and triplane-to-triplane attention.

実験結果

リサーチクエスチョン

RQ1Can a single-stage diffusion model conditioned on text or a single image generate diverse, high-fidelity 3D assets without 3D supervision?
RQ2Does a large transformer-based 3D reconstruction denoiser enable robust multi-view denoising and stable 3D reconstruction across diverse object categories?
RQ3How do time conditioning and Plucker-ray camera conditioning affect diffusion-based 3D generation quality and stability?
RQ4What is the impact of multi-view input count on 3D reconstruction quality and stability in a diffusion-based pipeline?
RQ5How does DMV3D perform on single-image reconstruction and text-to-3D tasks compared to prior 3D diffusion approaches?

主な発見

#Views	FID ↓	CLIP ↑	PSNR ↑	SSIM ↑	LPIPS ↓	CD ↓
4 (Ours)	35.16	0.888	21.798	0.852	0.150	0.0459
1	70.59	0.788	17.560	0.832	0.304	0.0775
2	47.69	0.896	20.965	0.851	0.167	0.0544
6	39.11	0.899	21.545	0.861	0.148	0.0454
w.o Novel	102.00	0.801	17.772	0.838	0.289	0.185
w.o Plucker	43.31	0.883	20.930	0.842	0.185	0.0505

Achieves fast 3D generation (~30s on a single A100 GPU) by integrating NeRF reconstruction into a 2D diffusion denoiser.
Outperforms prior 3D diffusion models on single-image reconstruction and text-to-3D benchmarks.
Demonstrates state-of-the-art results for single-image 3D reconstruction on ABO and GSO datasets (quantitative improvements across multiple metrics).
Produces diverse high-fidelity 3D assets from the same input image due to the probabilistic diffusion process.
Enables controllable 3D generation conditioning on text and images with competitive quality.
Demonstrates robustness to out-of-domain inputs through MVImgNet and Objaverse data, aided by novel camera conditioning via Plucker rays.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。