QUICK REVIEW

[論文レビュー] Consistent123: One Image to Highly Consistent 3D Asset Using Case-Aware Diffusion Priors

Yukang Lin, Haonan Han|arXiv (Cornell University)|Sep 29, 2023

Advanced Vision and Imaging被引用数 8

ひとこと要約

Consistent123 reconstructs highly 3D-consistent assets from a single image using a two-stage, case-aware diffusion-prior framework that first leverages 3D priors and later blends 2D texture priors with adaptive dynamics.

ABSTRACT

Reconstructing 3D objects from a single image guided by pretrained diffusion models has demonstrated promising outcomes. However, due to utilizing the case-agnostic rigid strategy, their generalization ability to arbitrary cases and the 3D consistency of reconstruction are still poor. In this work, we propose Consistent123, a case-aware two-stage method for highly consistent 3D asset reconstruction from one image with both 2D and 3D diffusion priors. In the first stage, Consistent123 utilizes only 3D structural priors for sufficient geometry exploitation, with a CLIP-based case-aware adaptive detection mechanism embedded within this process. In the second stage, 2D texture priors are introduced and progressively take on a dominant guiding role, delicately sculpting the details of the 3D model. Consistent123 aligns more closely with the evolving trends in guidance requirements, adaptively providing adequate 3D geometric initialization and suitable 2D texture refinement for different objects. Consistent123 can obtain highly 3D-consistent reconstruction and exhibits strong generalization ability across various objects. Qualitative and quantitative experiments show that our method significantly outperforms state-of-the-art image-to-3D methods. See https://Consistent123.github.io for a more comprehensive exploration of our generated 3D assets.

研究の動機と目的

robust single-image 3D reconstruction の促進と、ビュー間で高い 3D 一貫性を実現すること。
3D 構造 priors を優先した後、2D テクスチャ priors を統合する二段階パイプラインを提案する。
ケース認識型 CLIP ベースの境界検出を導入し、ステージの切り替え時を判断する。
fidelity を保ちながら 3D から 2D ガイダンスへ徐々にシフトするダイナミック priors メカニズムを開発する。

提案手法

Stage 1 は 3D 構造 priors のみを用いてジオメトリを初期化し、ケース認識型境界検出を用いて切替点を決定する。
Stage 2 では3D と 2D priors を timestep-dependent exponential weighting でブレンドするダイナミック priors を導入し、徐々に 2D テクスチャの詳細を強調する。
ジオメトリ初期化には 3D priors (Zero-1-to-3) を、テクスチャ refine には SDS 損失を介して 2D priors (Stable Diffusion) を使用する。
reconstruction 進行を監視する適応的な視点サンプリング CLIP 類似度指標を用いてステージ切り替えをトリガーする。
最終資産はメッシュベースの構造で表現（DMTet/NeRF ハイブリッドを介して）し、3D 一貫性と詳細なテクスチャを保証する。
4 は参照ビュー/新規ビューの二視点フレームワークを実装し、参照ビューでRGB、マスク、深度の損失を導入する。

実験結果

リサーチクエスチョン

RQ1ケース認識型の二段階 diffusion-prior フレームワークは、単一画像法の最先端と比較してより高い 3D 一貫性を達成できるか。
RQ2段階的最適化（3D priors を先、次に動的 2D+3D priors）は、構造正確性を損なうことなくテクスチャ忠実度を改善するか。
RQ3適応的な CLIP 主導境界検出は、対象カテゴリを跨いで 3D-prior からダイナミック priors ステージへ切り替える時期を信頼性高く決定できるか。
RQ4 timestep-based dynamic prior（指数ブレンド）は、3D と 2D priors の結合において Linear や Logarithmic ブレンドより優れているか。

主な発見

Dataset	Method	CLIP-Similarity ↑	PSNR ↑	LPIPS ↓
RealFusion15	RealFusion	0.735	20.216	0.197
RealFusion15	Make-it-3D	0.839	20.010	0.119
RealFusion15	Zero-1-to-3	0.759	25.386	0.068
RealFusion15	Magic123	0.747	25.637	0.062
RealFusion15	Consistent123 (ours)	0.844	25.682	0.056
C10	RealFusion	0.680	22.355	0.140
C10	Make-it-3D	0.824	19.412	0.120
C10	Zero-1-to-3	0.700	18.292	0.229
C10	Magic123	0.751	15.538	0.197
C10	Consistent123 (ours)	0.770	25.327	0.054

Consistent123 は Novel views における CLIP 類似度で RealFusion15 および C10 のベースラインより高い 3D 一貫性を達成。
二段階アプローチは複数の指標で RealFusion、Make-it-3D、Zero-1-to-3、Magic123 を上回る優れたテクスチャとジオメトリを示す。
指数的ダイナミック priors ブレンドは、カテゴリ全体で CLIP 類似度、PSNR、LPIPS の点で線形および対数ブレンドより優れている。
CLIP ベースの境界検出メカニズムは、3D 構造が十分に回復して Stage 2 へ遷移するタイミングを効果的に通知する。
定性的結果は、複数の顔アーチファクトの低減とビュー間のテクスチャ連続性の改善を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。