QUICK REVIEW

[论文解读] Scalable 3D Captioning with Pretrained Models

Tiange Luo, Chris Rockwell|arXiv (Cornell University)|Jun 12, 2023

Multimodal Machine Learning Applications被引用 20

一句话总结

Cap3D 自动通过将多视角渲染与预训练图像描述、图像-文本对齐和大语言模型结合，自动生成描述性3D对象 caption，实现可扩展的3D-文本数据和具有竞争力的文本到3D 表现。

ABSTRACT

We introduce Cap3D, an automatic approach for generating descriptive text for 3D objects. This approach utilizes pretrained models from image captioning, image-text alignment, and LLM to consolidate captions from multiple views of a 3D asset, completely side-stepping the time-consuming and costly process of manual annotation. We apply Cap3D to the recently introduced large-scale 3D dataset, Objaverse, resulting in 660k 3D-text pairs. Our evaluation, conducted using 41k human annotations from the same dataset, demonstrates that Cap3D surpasses human-authored descriptions in terms of quality, cost, and speed. Through effective prompt engineering, Cap3D rivals human performance in generating geometric descriptions on 17k collected annotations from the ABO dataset. Finally, we finetune Text-to-3D models on Cap3D and human captions, and show Cap3D outperforms; and benchmark the SOTA including Point-E, Shape-E, and DreamFusion.

研究动机与目标

通过利用大规模图像-文本模型，解决高质量3D caption 的稀缺性和成本问题。
构建可扩展的管道，为3D资产生成准确的多视角caption。
在 Objaverse 上评估 Cap3D，以生成大型3D-文本数据集并与人类caption 进行对比。
使用 ABO 探索几何caption 能力并探索基于提示的 QA 增强。

提出的方法

使用 Blender 渲染每个3D对象的多视角（M=8 视图）。
使用 BLIP2 图像描述生成每个视图的 N=5 条 caption。
使用 CLIP 图像-文本对齐过滤 caption，以选择视图- caption 对。
将选定的视图 caption 汇总成最终 caption，使用 GPT-4 跨视图总结和融合信息。
可选地应用两阶段 QA 提示，以强调细粒度几何（Cap3D QA）。
通过移除不可分发资产（面部/NSFW）和应用语言过滤，对数据集进行伦理过滤。

Figure 1: Cap3D provides detailed descriptions of 3D objects by leveraging pretrained models in captioning, alignment, and LLM to consolidate multi-view information. Two views of 3D objects are shown here, Cap3D uses eight. Additional examples are available in Appendix B .

实验结果

研究问题

RQ1Cap3D 是否能够在不需要手动标注的情况下大规模产生高质量的多视图 caption？
RQ2将基于视图的 caption 与LLM 汇总相比，单视图 caption 在细节和准确性方面有何差异？
RQ3caption 质量、成本与速度之间相对于众包在3D标注中的权衡？
RQ4Cap3D caption 对后续文本到3D模型微调的支持，与人类 caption 相比如何？
RQ5使用 QA 提示是否提升 ABO 类数据集的几何细节？

主要发现

Cap3D 的 caption 在 Objaverse 的质量、成本和速度方面优于众包 caption（人类在 A/B 测试中的偏好比例 Cap3D ~52% 对 38%；Cap3D ~8.35 相较于人类约 $87.18 每 1k 注释；Cap3D ~65k 对象/天 vs 人类 1.4k）。
使用 CLIP 过滤（Cap3D）可降低错误细节并减少 token 使用，将成本从 $15.33 降至 $4.18。
GPT-4 的跨视图汇总比单视图方法产生更丰富、更一致的对象描述。
在 Cap3D caption 上对最先进的文本到3D 模型（Point·E、Shap·E）进行微调，在 Objaverse 数据上改善若干 CLIP 基于指标和 FID，常常超过预训练基线。
Cap3D QA 提示使 ABO 数据上的几何聚焦描述接近人类水平细节，优于标准自动描述。
Cap3D 生成的 caption 能实现可扩展、数据高效的文本到3D模型微调，并在 Objaverse 构建了一个大型3D-文本数据集（660k 对）上。

Figure 2: Overview of Cap3D . Left to Right: (1) Render 3D objects from $M=8$ camera angles to capture object details (2) Generate $N=5$ image captions per rendered image using BLIP2; (3) Select one caption for each image based on its similarity to the image encoding using CLIP; (4) Use GPT4 to cons

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。