[論文レビュー] DMV3D: Denoising Multi-View Diffusion using 3D Large Reconstruction Model
DMV3D は、大規模 Transformer ベースの 3D 再構成デノイザーを備えた単段階・カテゴリ非依存の拡散モデルで、三平面 NeRF を生み出す。これにより、3D スーパービジョンなしで、テキストまたは画像条件付きの高速(約 30 秒)3D 生成が可能となる。
We propose extbf{DMV3D}, a novel 3D generation approach that uses a transformer-based 3D large reconstruction model to denoise multi-view diffusion. Our reconstruction model incorporates a triplane NeRF representation and can denoise noisy multi-view images via NeRF reconstruction and rendering, achieving single-stage 3D generation in $\sim$30s on single A100 GPU. We train extbf{DMV3D} on large-scale multi-view image datasets of highly diverse objects using only image reconstruction losses, without accessing 3D assets. We demonstrate state-of-the-art results for the single-image reconstruction problem where probabilistic modeling of unseen object parts is required for generating diverse reconstructions with sharp textures. We also show high-quality text-to-3D generation results outperforming previous 3D diffusion models. Our project website is at: https://justimyhxu.github.io/projects/dmv3d/ .
研究の動機と目的
- Aim to achieve fast, realistic, and generic 3D generation without per-asset optimization or 3D supervision.
- Leverage a 3D large reconstruction model as a multi-view denoiser within a diffusion framework.
- Train on large-scale multi-view image datasets using only image-space supervision.
- Produce diverse high-fidelity 3D assets from text or a single image.
- Demonstrate state-of-the-art 3D reconstruction quality and competitive text-to-3D results.
提案手法
- Use a 2D multi-view diffusion model whose denoiser is a 3D reconstruction module that outputs a clean triplane NeRF from noisy multi-view inputs.
- Represent the 3D scene as a triplane NeRF and render with differentiable volume rendering to supervise reconstruction (via novel-view renderings).
- Condition the denoiser on diffusion time step (time conditioning) and camera rays (Plucker coordinates) to handle noise and diverse viewpoints.
- Extend to image conditioning (using one clean view as input and denoising others) and text conditioning (via CLIP embeddings and cross-attention) for controllable 3D generation.
- Train with a reconstruction loss that includes both input and novel-view renderings to enforce 3D consistency (Equation L_recon).
- Base the denoiser on the large reconstruction model (LRM) architecture with transformer-based triplane-to-image and triplane-to-triplane attention.
実験結果
リサーチクエスチョン
- RQ1Can a single-stage diffusion model conditioned on text or a single image generate diverse, high-fidelity 3D assets without 3D supervision?
- RQ2Does a large transformer-based 3D reconstruction denoiser enable robust multi-view denoising and stable 3D reconstruction across diverse object categories?
- RQ3How do time conditioning and Plucker-ray camera conditioning affect diffusion-based 3D generation quality and stability?
- RQ4What is the impact of multi-view input count on 3D reconstruction quality and stability in a diffusion-based pipeline?
- RQ5How does DMV3D perform on single-image reconstruction and text-to-3D tasks compared to prior 3D diffusion approaches?
主な発見
| #Views | FID ↓ | CLIP ↑ | PSNR ↑ | SSIM ↑ | LPIPS ↓ | CD ↓ |
|---|---|---|---|---|---|---|
| 4 (Ours) | 35.16 | 0.888 | 21.798 | 0.852 | 0.150 | 0.0459 |
| 1 | 70.59 | 0.788 | 17.560 | 0.832 | 0.304 | 0.0775 |
| 2 | 47.69 | 0.896 | 20.965 | 0.851 | 0.167 | 0.0544 |
| 6 | 39.11 | 0.899 | 21.545 | 0.861 | 0.148 | 0.0454 |
| w.o Novel | 102.00 | 0.801 | 17.772 | 0.838 | 0.289 | 0.185 |
| w.o Plucker | 43.31 | 0.883 | 20.930 | 0.842 | 0.185 | 0.0505 |
- Achieves fast 3D generation (~30s on a single A100 GPU) by integrating NeRF reconstruction into a 2D diffusion denoiser.
- Outperforms prior 3D diffusion models on single-image reconstruction and text-to-3D benchmarks.
- Demonstrates state-of-the-art results for single-image 3D reconstruction on ABO and GSO datasets (quantitative improvements across multiple metrics).
- Produces diverse high-fidelity 3D assets from the same input image due to the probabilistic diffusion process.
- Enables controllable 3D generation conditioning on text and images with competitive quality.
- Demonstrates robustness to out-of-domain inputs through MVImgNet and Objaverse data, aided by novel camera conditioning via Plucker rays.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。