QUICK REVIEW

[論文レビュー] T$^3$Bench: Benchmarking Current Progress in Text-to-3D Generation

Yuze He, Yushi Bai|arXiv (Cornell University)|Oct 4, 2023

Human Motion and Animation被引用数 11

ひとこと要約

T3Benchは、テキストから3D生成の最初の包括的自動ベンチマークを提示します。多様なプロンプトと人間の判断と相関する多視点の品質/アライメント指標を備え、10の主要手法を評価します。

ABSTRACT

Recent methods in text-to-3D leverage powerful pretrained diffusion models to optimize NeRF. Notably, these methods are able to produce high-quality 3D scenes without training on 3D data. Due to the open-ended nature of the task, most studies evaluate their results with subjective case studies and user experiments, thereby presenting a challenge in quantitatively addressing the question: How has current progress in Text-to-3D gone so far? In this paper, we introduce T$^3$Bench, the first comprehensive text-to-3D benchmark containing diverse text prompts of three increasing complexity levels that are specially designed for 3D generation. To assess both the subjective quality and the text alignment, we propose two automatic metrics based on multi-view images produced by the 3D contents. The quality metric combines multi-view text-image scores and regional convolution to detect quality and view inconsistency. The alignment metric uses multi-view captioning and GPT-4 evaluation to measure text-3D consistency. Both metrics closely correlate with different dimensions of human judgments, providing a paradigm for efficiently evaluating text-to-3D models. The benchmarking results, shown in Fig. 1, reveal performance differences among an extensive 10 prevalent text-to-3D methods. Our analysis further highlights the common struggles for current methods on generating surroundings and multi-object scenes, as well as the bottleneck of leveraging 2D guidance for 3D generation. Our project page is available at: https://t3bench.com.

研究の動機と目的

3D幾何学、視点の一貫性、およびテキスト整合性を反映する、テキストから3D生成の包括的で自動的なベンチマークを定義する。
3つの複雑さが段階的に増すプロンプトセットを作成して現在の手法を検証する。
多視点の2Dレンダリングを活用して品質とプロンプトとの整合を評価する自動指標を提案・検証する。
評価の一貫性のため3D表現をメッシュに統一し、手法間で公正な比較を可能にする。

提案手法

GPT-4で生成され、ROUGE-Lで多様性フィルタリングされた3つの複雑さレベルのプロンプトを設計する。
さまざまな3D出力（NeRFベース）をDMTetまたはMarching Cubesを用いて統一された3Dメッシュへ変換し、ベンチマークに用いる。
正二十面体上の161視点から3Dシーンを撮影し、複数焦点キャプチャを用いて採点に最適な視点を選択する。
多視点CLIPと地域的畳み込みを用いたImageRewardで品質を評価し、視点の一貫性の欠如（Janus問題）を検出する。
12のicosahedronビューを横断する3D-to-textキャプショニング（BLIP）と、GPT-4ベースのテキストリコール評価（ROUGE-L）で整合性を評価する。
指標と人間のスコア全体でSpearman/Kendall/Pearsonを介して人間の判断とのベースライン相関を確立する。

実験結果

リサーチクエスチョン

RQ1現在のテキストから3Dへの手法は、単一オブジェクトのみを要求するプロンプト、周囲の文脈を含むプロンプト、および複数オブジェクトのシーンのそれぞれでどのように性能を示すか？
RQ2自動の多視点品質および整合性指標は、3Dコンテンツの品質とプロンプト忠実性に関する人間の判断を信頼性高く反映できるか？
RQ32Dガイダンスから一貫した3Dシーン生成へ移行する際の主なボトルネックは何か、異なる3D表現はベンチマーク結果にどう影響するか？
RQ4視点の一貫性の問題（例：Janus問題）は、方法間の品質評価と整合性評価にどの程度影響するか？

主な発見

手法	実行時間	単一オブジェクトの品質	単一オブジェクトの整合性	単一オブジェクトの平均	周囲を含む単一オブジェクトの品質	周囲を含む単一オブジェクトの整合性	周囲を含む単一オブジェクトの平均	複数オブジェクトの品質	複数オブジェクトの整合性	複数オブジェクトの平均
Dreamfusion	30min	24.9	24.0	24.4	19.3	29.8	24.6	17.3	14.8	16.1
Magic3D	40min	38.7	35.3	37.0	29.8	41.0	35.4	26.6	24.8	25.7
LatentNeRF	65min	34.2	32.0	33.1	23.7	37.5	30.6	21.7	19.5	20.6
Fantasia3D	45min	29.2	23.5	26.4	21.9	32.0	27.0	22.7	14.3	18.5
SJC	25min	26.3	23.0	24.7	17.3	22.3	19.8	17.7	5.8	11.7
ProlificDreamer	240min	51.1	47.8	49.4	42.5	47.0	44.8	45.7	25.8	35.8
MVDream	30min	53.2	42.3	47.8	36.3	48.5	42.4	39.0	28.5	33.8
DreamGaussian	7min	19.9	19.8	19.8	10.4	17.8	14.1	12.3	9.5	10.9
GeoDream	400min	48.4	33.8	41.1	35.2	34.5	34.9	34.3	16.5	25.4
RichDreamer	70min	57.3	40.0	48.6	43.9	42.3	43.1	34.8	22.0	28.4

T3Benchは、3つのプロンプトセットにわたって10の主要なテキストから3Dへの手法間に強いばらつきを明らかにし、シーンの複雑さが増すほど性能が低下する。
多視点品質（region convolutionを介して）と多視点整合（3Dキャプショニング+GPT-4リコールを介して）は、人間の判断と良く相関する（Spearman/Kendall/Pearson >= 0.75）。
視点の一貫性の問題（Janus問題）は品質スコアに大きく影響し、region convolutionが緩和に寄与する。
拡散モデルからの2Dガイダンスの品質は、3D生成品質を信頼性高く予測しないことが多く、2D手掛かりから3D構造を学習する難しさを浮き彫りにしている。
VSDベースの手法は複雑なシーン生成を改善するが、過剰な詳細を導入したり3D/多視 priorsを十分活用できず、整合性に影響を及ぼす可能性がある。
幾何初期化と多視点拡散モデルは有望だが、分布外のプロンプトや非常に複雑なシーンでは苦戦する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。