Skip to main content
QUICK REVIEW

[論文レビュー] DMV3D: Denoising Multi-View Diffusion using 3D Large Reconstruction Model

Yinghao Xu, Hao Tan|arXiv (Cornell University)|Nov 15, 2023
Generative Adversarial Networks and Image Synthesis被引用数 18
ひとこと要約

DMV3D は、大規模 Transformer ベースの 3D 再構成デノイザーを備えた単段階・カテゴリ非依存の拡散モデルで、三平面 NeRF を生み出す。これにより、3D スーパービジョンなしで、テキストまたは画像条件付きの高速(約 30 秒)3D 生成が可能となる。

ABSTRACT

We propose extbf{DMV3D}, a novel 3D generation approach that uses a transformer-based 3D large reconstruction model to denoise multi-view diffusion. Our reconstruction model incorporates a triplane NeRF representation and can denoise noisy multi-view images via NeRF reconstruction and rendering, achieving single-stage 3D generation in $\sim$30s on single A100 GPU. We train extbf{DMV3D} on large-scale multi-view image datasets of highly diverse objects using only image reconstruction losses, without accessing 3D assets. We demonstrate state-of-the-art results for the single-image reconstruction problem where probabilistic modeling of unseen object parts is required for generating diverse reconstructions with sharp textures. We also show high-quality text-to-3D generation results outperforming previous 3D diffusion models. Our project website is at: https://justimyhxu.github.io/projects/dmv3d/ .

研究の動機と目的

  • Aim to achieve fast, realistic, and generic 3D generation without per-asset optimization or 3D supervision.
  • Leverage a 3D large reconstruction model as a multi-view denoiser within a diffusion framework.
  • Train on large-scale multi-view image datasets using only image-space supervision.
  • Produce diverse high-fidelity 3D assets from text or a single image.
  • Demonstrate state-of-the-art 3D reconstruction quality and competitive text-to-3D results.

提案手法

  • Use a 2D multi-view diffusion model whose denoiser is a 3D reconstruction module that outputs a clean triplane NeRF from noisy multi-view inputs.
  • Represent the 3D scene as a triplane NeRF and render with differentiable volume rendering to supervise reconstruction (via novel-view renderings).
  • Condition the denoiser on diffusion time step (time conditioning) and camera rays (Plucker coordinates) to handle noise and diverse viewpoints.
  • Extend to image conditioning (using one clean view as input and denoising others) and text conditioning (via CLIP embeddings and cross-attention) for controllable 3D generation.
  • Train with a reconstruction loss that includes both input and novel-view renderings to enforce 3D consistency (Equation L_recon).
  • Base the denoiser on the large reconstruction model (LRM) architecture with transformer-based triplane-to-image and triplane-to-triplane attention.

実験結果

リサーチクエスチョン

  • RQ1Can a single-stage diffusion model conditioned on text or a single image generate diverse, high-fidelity 3D assets without 3D supervision?
  • RQ2Does a large transformer-based 3D reconstruction denoiser enable robust multi-view denoising and stable 3D reconstruction across diverse object categories?
  • RQ3How do time conditioning and Plucker-ray camera conditioning affect diffusion-based 3D generation quality and stability?
  • RQ4What is the impact of multi-view input count on 3D reconstruction quality and stability in a diffusion-based pipeline?
  • RQ5How does DMV3D perform on single-image reconstruction and text-to-3D tasks compared to prior 3D diffusion approaches?

主な発見

#ViewsFID ↓CLIP ↑PSNR ↑SSIM ↑LPIPS ↓CD ↓
4 (Ours)35.160.88821.7980.8520.1500.0459
170.590.78817.5600.8320.3040.0775
247.690.89620.9650.8510.1670.0544
639.110.89921.5450.8610.1480.0454
w.o Novel102.000.80117.7720.8380.2890.185
w.o Plucker43.310.88320.9300.8420.1850.0505
  • Achieves fast 3D generation (~30s on a single A100 GPU) by integrating NeRF reconstruction into a 2D diffusion denoiser.
  • Outperforms prior 3D diffusion models on single-image reconstruction and text-to-3D benchmarks.
  • Demonstrates state-of-the-art results for single-image 3D reconstruction on ABO and GSO datasets (quantitative improvements across multiple metrics).
  • Produces diverse high-fidelity 3D assets from the same input image due to the probabilistic diffusion process.
  • Enables controllable 3D generation conditioning on text and images with competitive quality.
  • Demonstrates robustness to out-of-domain inputs through MVImgNet and Objaverse data, aided by novel camera conditioning via Plucker rays.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。