Skip to main content
QUICK REVIEW

[Paper Review] Shap-E: Generating Conditional 3D Implicit Functions

Heewoo Jun, Alex Nichol|arXiv (Cornell University)|May 3, 2023
Generative Adversarial Networks and Image Synthesis112 citations
TL;DR

Shap-E trains a two-stage model that encodes 3D assets into implicit function parameters and then learns a conditional diffusion prior to generate diverse text- or image-conditioned 3D assets that can render as NeRFs or textured meshes. It achieves faster convergence and competitive sample quality compared to Point-E while enabling multi-representation outputs.

ABSTRACT

We present Shap-E, a conditional generative model for 3D assets. Unlike recent work on 3D generative models which produce a single output representation, Shap-E directly generates the parameters of implicit functions that can be rendered as both textured meshes and neural radiance fields. We train Shap-E in two stages: first, we train an encoder that deterministically maps 3D assets into the parameters of an implicit function; second, we train a conditional diffusion model on outputs of the encoder. When trained on a large dataset of paired 3D and text data, our resulting models are capable of generating complex and diverse 3D assets in a matter of seconds. When compared to Point-E, an explicit generative model over point clouds, Shap-E converges faster and reaches comparable or better sample quality despite modeling a higher-dimensional, multi-representation output space. We release model weights, inference code, and samples at https://github.com/openai/shap-e.

Motivation & Objective

  • Motivate generation of conditional 3D assets expressed as implicit functions rather than fixed representations.
  • Develop a scalable encoder-diffusion framework that maps 3D assets to implicit function parameters.
  • Train a diffusion prior on encoder outputs conditioned on text or images to enable text- and image-conditioned 3D generation.
  • Demonstrate that implicit representations can achieve comparable or better sample quality with faster inference than explicit point-cloud baselines.

Proposed method

  • Train a Transformer-based encoder that maps dense 3D representations (point clouds and rendered views) to implicit function parameters for an MLP that serves as both NeRF and STF.
  • Pre-train the encoder with NeRF rendering objectives, then extend with SDF and texture heads and stabilize with distillation before fine-tuning.
  • Train a diffusion prior on encoder outputs (latent vectors) conditioned on text or images, using classifier-free guidance during sampling.
  • Use latent diffusion with sequences of latent vectors corresponding to MLP weight rows, enabling high-dimensional implicit representations.
  • Render outputs as both NeRF-based and STF-based meshes via differentiable rendering and marching cubes, with end-to-end finetuning for STF outputs.
  • Adopt latent diffusion training and sampling strategies analogous to Point-E, with direct x0 prediction and guidance scales for conditioning.

Experimental results

Research questions

  • RQ1Can a diffusion model conditioned on text or images generate diverse, high-quality 3D assets encoded as implicit functions?
  • RQ2Does predicting implicit MLP weights directly in a latent diffusion space yield competitive results versus explicit 3D representations like point clouds?
  • RQ3How does the Shap-E approach scale in speed and sample quality relative to prior 3D generative models (e.g., Point-E) when conditioned on text or images?
  • RQ4What are the trade-offs between NeRF rendering and STF (texture/mesh) rendering in the context of a unified implicit representation?

Key findings

  • Shap-E achieves faster convergence and comparable or superior sample quality to Point-E on several metrics.
  • Text-conditioned Shap-E improves CLIP-based metrics over the comparable Point-E model, though some overfitting is observed at later training stages.
  • Shap-E enables both NeRF and textured mesh renderings from the same implicit-function representation.
  • With large-scale data, Shap-E produces diverse, recognizable 3D assets conditioned on text or image prompts.
  • Inference latency is significantly lower than optimization-based 3D generation approaches, and faster than some prior diffusion-based 3D methods.
  • Qualitative analysis reveals shared success/failure patterns between Shap-E and Point-E under image conditioning, but notable differences emerge with text conditioning.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.