QUICK REVIEW

[论文解读] InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

Jiale Xu, Weihao Cheng|arXiv (Cornell University)|Apr 10, 2024

Computer Graphics and Visualization Techniques被引用 14

一句话总结

InstantMesh 提供一个快速的、前馈式单图像到 3D 网格管线，通过将多视图扩散模型与稀疏视角的大型重建模型 (LRM) 和可微分等值面提取相结合，在大约 10 秒内实现最先进的图像到 3D 结果。

ABSTRACT

We present InstantMesh, a feed-forward framework for instant 3D mesh generation from a single image, featuring state-of-the-art generation quality and significant training scalability. By synergizing the strengths of an off-the-shelf multiview diffusion model and a sparse-view reconstruction model based on the LRM architecture, InstantMesh is able to create diverse 3D assets within 10 seconds. To enhance the training efficiency and exploit more geometric supervisions, e.g, depths and normals, we integrate a differentiable iso-surface extraction module into our framework and directly optimize on the mesh representation. Experimental results on public datasets demonstrate that InstantMesh significantly outperforms other latest image-to-3D baselines, both qualitatively and quantitatively. We release all the code, weights, and demo of InstantMesh, with the intention that it can make substantial contributions to the community of 3D generative AI and empower both researchers and content creators.

研究动机与目标

推动面向广泛应用（VR/AR、设计、游戏）的快速、可扩展的单图像到 3D 网格生成
通过大型重建模型利用开放世界的 3D prior 来提升泛化能力
将多视图扩散与面向网格的重建和可微监督（深度/法线）整合，以提升几何和纹理
通过网格为基础的监督和阶段性训练，实现对大规模数据集的培训效率与可扩展性

提出的方法

使用多视图扩散模型从单个输入图像生成六个一致的新视图
采用基于 Transformer 的稀疏视图 Large Reconstruction Model (LRM) 直接从所生成的视图预测 3D 网格
将几何表示为网格，并整合一个可微分的等值面提取模块（FlexiCubes）以实现高效的表面提取与监督
两阶段训练：（阶段 1）在三平面 NeRF 表示上进行图像/掩码损失的训练；（阶段 2）切换为网格表示，并进行深度/法线监督与正则化
对一个白色背景、六视图输出进行微调扩散模型（Zero123++），以确保后续网格重建的稳定性
提供四种模型变体（NeRF/base、NeRF/large、Mesh/base、Mesh/large）并发布权重以便实际使用

Figure 2 : The overview of our InstantMesh framework. Given an input image, we first utilize a multi-view diffusion model to synthesize 6 novel views at fixed camera poses. Then we feed the generated multi-view images into a transformer-based sparse-view large reconstruction model to reconstruct a h

实验结果

研究问题

RQ1单个输入图像是否能够通过扩散模型的多视图生成和稀疏视图 LRM 的组合，在几秒内转化为高质量的 3D 网格？
RQ2将可微分的等值面提取和直接网格监督整合，是否相较于基于 triplane/NeRF 的方法能够改善几何与纹理？
RQ3在多视图监督下，基于网格的重建与基于 NeRF 的重建在 2D 视图质量与 3D 几何精确度方面有何差异？
RQ4不同输入视图数量和训练策略对扩展性与对开放世界对象的泛化能力有何影响？

主要发现

Method	PSNR	SSIM	LPIPS	CD	FS
TripoSR	23.373	0.868	0.213	0.217	0.843
LGM	21.538	0.871	0.216	0.345	0.671
CRM	22.195	0.891	0.150	0.252	0.787
SV3D	22.098	0.861	0.201	-	-
Ours (NeRF)	23.141	0.898	0.119	0.177	0.882
Ours (Mesh)	22.794	0.897	0.120	0.180	0.880
Table3_TripoSR	21.996	0.877	0.198	0.245	0.811
Table3_LGM	20.434	0.864	0.226	0.382	0.635
Table3_CRM	21.630	0.892	0.147	0.246	0.802
Table3_SV3D	21.510	0.866	0.186	-	-
Table3_OursNeRF	22.635	0.903	0.110	0.199	0.869
Table3_OursMesh	21.954	0.901	0.112	0.203	0.864
Table4_TripoSR	19.977	0.859	0.206	0.221	0.847
Table4_LGM	18.665	0.832	0.250	0.356	0.653
Table4_CRM	19.422	0.865	0.172	0.274	0.778
Table4_SV3D	20.294	0.853	0.176	-	-
Table4_OursNeRF	19.752	0.869	0.150	0.206	0.863
Table4_OursMesh	19.552	0.868	0.150	0.204	0.866

InstantMesh 在公开数据集上达到最先进的图像到 3D 性能，在 2D 新视图质量（SSIM、LPIPS）和 3D 几何（Chamfer CD、F-Score）方面超过基线。
采用 FlexiCubes 的网格化变体提供更平滑的表面以及比基于 triplane 的 NeRF 方法更强的几何监督。
该框架能够在大约 10 秒内从单个图像生成多样的高质量 3D 资产。
提供四种模型变体（NeRF/base、NeRF/large、Mesh/base、Mesh/large），以满足不同应用需求并发布权重。
通过基于 LRM 的体系结构和面向网格的监督策略，支持对大规模到超大规模数据的训练。

Figure 3 : The 3D meshes generated by InstantMesh demonstrate significantly better geometry and texture compared to the other baselines. The results of InstantMesh are rendered at a fixed elevation of $20^{\circ}$ , while the results of other methods are rendered at a fixed elevation of $0^{\circ}$

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。