QUICK REVIEW

[论文解读] T$^3$Bench: Benchmarking Current Progress in Text-to-3D Generation

Yuze He, Yushi Bai|arXiv (Cornell University)|Oct 4, 2023

Human Motion and Animation被引用 11

一句话总结

T3Bench 提供了第一个针对文本到3D生成的全面自动基准测试，具备多样的提示和与人类判断相关的多视角质量/对齐指标，并评估了10种主流方法。

ABSTRACT

Recent methods in text-to-3D leverage powerful pretrained diffusion models to optimize NeRF. Notably, these methods are able to produce high-quality 3D scenes without training on 3D data. Due to the open-ended nature of the task, most studies evaluate their results with subjective case studies and user experiments, thereby presenting a challenge in quantitatively addressing the question: How has current progress in Text-to-3D gone so far? In this paper, we introduce T$^3$Bench, the first comprehensive text-to-3D benchmark containing diverse text prompts of three increasing complexity levels that are specially designed for 3D generation. To assess both the subjective quality and the text alignment, we propose two automatic metrics based on multi-view images produced by the 3D contents. The quality metric combines multi-view text-image scores and regional convolution to detect quality and view inconsistency. The alignment metric uses multi-view captioning and GPT-4 evaluation to measure text-3D consistency. Both metrics closely correlate with different dimensions of human judgments, providing a paradigm for efficiently evaluating text-to-3D models. The benchmarking results, shown in Fig. 1, reveal performance differences among an extensive 10 prevalent text-to-3D methods. Our analysis further highlights the common struggles for current methods on generating surroundings and multi-object scenes, as well as the bottleneck of leveraging 2D guidance for 3D generation. Our project page is available at: https://t3bench.com.

研究动机与目标

定义一个全面、自动的文本到3D生成基准，反映3D几何、视角一致性和文本对齐。
创建三组递增复杂度的提示集（单一对象、带环境的单一对象、多个对象），以探查当前方法。
提出并验证利用多视角2D渲染的自动化指标，以评估质量与提示的对齐。
将3D表示统一为网格以实现一致评估，并实现不同方法之间的公平比较。

提出的方法

设计三个复杂度等级的提示，由GPT-4生成并通过ROUGE-L筛选以实现多样性。
将多样化的3D输出（基于NeRF）转换为统一的3D网格，通过DMTet或Marching Cubes进行基准测试。
在icosahedron的161个视点上捕捉3D场景，使用多焦点采集以选取最佳视点进行评分。
使用区域卷积的多视角CLIP和ImageReward来评估质量，并检测视角不一致性（Janus问题）。
通过3D到文本的字幕（BLIP）在12个icosahedron视角下评估对齐，以及基于GPT-4的文本召回评估（ROUGE-L）。
通过斯皮尔曼/肯德尔/皮尔逊相关性在指标与人类分数之间建立基线相关性。

实验结果

研究问题

RQ1当前文本到3D方法在仅需要单一对象、包含环境、以及包含多个对象的提示下表现如何？
RQ2自动化的多视角质量和对齐指标是否能可靠反映人类对3D内容质量和提示保真度的判断？
RQ3从2D引导过渡到一致的3D场景生成的主要瓶颈是什么，不同的3D表示对基准测试结果有何影响？
RQ4视角一致性问题（例如Janus问题）在多大程度上影响方法的质量和对齐评估？

主要发现

T3Bench 在三组提示集上揭示了10种主流文本到3D方法的显著差异，随着场景复杂度增加，性能下降。
通过区域卷积的多视角质量与通过3D字幕+GPT-4召回的多视角对齐与人类判断相关性较高（Spearman/Kendall/Pearson >= 0.75）。
视角一致性问题（Janus问题）显著影响质量分数，区域卷积有助于缓解。
来自扩散模型的2D引导质量并不能可靠预测3D生成质量，凸显了从2D线索学习3D结构的挑战。
基于VSD的方法在复杂场景生成方面有所提升，但可能引入额外细节或未能充分利用3D/多视角先验，影响对齐。
几何初始化和多视角扩散模型显示出前景，但在分布外提示或高度复杂场景上仍表现欠佳。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。