[Paper Review] Shap-E: Generating Conditional 3D Implicit Functions
Shap-E trains a two-stage model that encodes 3D assets into implicit function parameters and then learns a conditional diffusion prior to generate diverse text- or image-conditioned 3D assets that can render as NeRFs or textured meshes. It achieves faster convergence and competitive sample quality compared to Point-E while enabling multi-representation outputs.
We present Shap-E, a conditional generative model for 3D assets. Unlike recent work on 3D generative models which produce a single output representation, Shap-E directly generates the parameters of implicit functions that can be rendered as both textured meshes and neural radiance fields. We train Shap-E in two stages: first, we train an encoder that deterministically maps 3D assets into the parameters of an implicit function; second, we train a conditional diffusion model on outputs of the encoder. When trained on a large dataset of paired 3D and text data, our resulting models are capable of generating complex and diverse 3D assets in a matter of seconds. When compared to Point-E, an explicit generative model over point clouds, Shap-E converges faster and reaches comparable or better sample quality despite modeling a higher-dimensional, multi-representation output space. We release model weights, inference code, and samples at https://github.com/openai/shap-e.
Motivation & Objective
- Motivate generation of conditional 3D assets expressed as implicit functions rather than fixed representations.
- Develop a scalable encoder-diffusion framework that maps 3D assets to implicit function parameters.
- Train a diffusion prior on encoder outputs conditioned on text or images to enable text- and image-conditioned 3D generation.
- Demonstrate that implicit representations can achieve comparable or better sample quality with faster inference than explicit point-cloud baselines.
Proposed method
- Train a Transformer-based encoder that maps dense 3D representations (point clouds and rendered views) to implicit function parameters for an MLP that serves as both NeRF and STF.
- Pre-train the encoder with NeRF rendering objectives, then extend with SDF and texture heads and stabilize with distillation before fine-tuning.
- Train a diffusion prior on encoder outputs (latent vectors) conditioned on text or images, using classifier-free guidance during sampling.
- Use latent diffusion with sequences of latent vectors corresponding to MLP weight rows, enabling high-dimensional implicit representations.
- Render outputs as both NeRF-based and STF-based meshes via differentiable rendering and marching cubes, with end-to-end finetuning for STF outputs.
- Adopt latent diffusion training and sampling strategies analogous to Point-E, with direct x0 prediction and guidance scales for conditioning.
Experimental results
Research questions
- RQ1Can a diffusion model conditioned on text or images generate diverse, high-quality 3D assets encoded as implicit functions?
- RQ2Does predicting implicit MLP weights directly in a latent diffusion space yield competitive results versus explicit 3D representations like point clouds?
- RQ3How does the Shap-E approach scale in speed and sample quality relative to prior 3D generative models (e.g., Point-E) when conditioned on text or images?
- RQ4What are the trade-offs between NeRF rendering and STF (texture/mesh) rendering in the context of a unified implicit representation?
Key findings
- Shap-E achieves faster convergence and comparable or superior sample quality to Point-E on several metrics.
- Text-conditioned Shap-E improves CLIP-based metrics over the comparable Point-E model, though some overfitting is observed at later training stages.
- Shap-E enables both NeRF and textured mesh renderings from the same implicit-function representation.
- With large-scale data, Shap-E produces diverse, recognizable 3D assets conditioned on text or image prompts.
- Inference latency is significantly lower than optimization-based 3D generation approaches, and faster than some prior diffusion-based 3D methods.
- Qualitative analysis reveals shared success/failure patterns between Shap-E and Point-E under image conditioning, but notable differences emerge with text conditioning.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.