[Paper Review] ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation
ELITE trains a learning-based encoder with global and local mapping networks to convert visual concepts into textual embeddings, enabling fast, accurate, and editable customized text-to-image generation using pre-trained diffusion models.
In addition to the unprecedented ability in imaginary creation, large text-to-image models are expected to take customized concepts in image generation. Existing works generally learn such concepts in an optimization-based manner, yet bringing excessive computation or memory burden. In this paper, we instead propose a learning-based encoder, which consists of a global and a local mapping networks for fast and accurate customized text-to-image generation. In specific, the global mapping network projects the hierarchical features of a given image into multiple new words in the textual word embedding space, i.e., one primary word for well-editable concept and other auxiliary words to exclude irrelevant disturbances (e.g., background). In the meantime, a local mapping network injects the encoded patch features into cross attention layers to provide omitted details, without sacrificing the editability of primary concepts. We compare our method with existing optimization-based approaches on a variety of user-defined concepts, and demonstrate that our method enables high-fidelity inversion and more robust editability with a significantly faster encoding process. Our code is publicly available at https://github.com/csyxwei/ELITE.
Motivation & Objective
- Enable fast and accurate customized text-to-image generation from a small set of concept images.
- Replace optimization-based concept learning with a learning-based encoder.
- Leverage multi-layer CLIP features to create a robust, editable primary concept word.
- Incorporate a local mapping network to inject detailed, location-specific information without losing editability.
- Demonstrate superior speed and competitive fidelity/editability against existing methods.
Proposed method
- Use a pre-trained CLIP image encoder to extract hierarchical features from the concept image.
- Train a global mapping network to produce multiple word embeddings from CLIP features, forming a primary word and auxiliary words for disturbances.
- Train a local mapping network to encode foreground details into textual feature space and inject via cross-attention to preserve local details.
- Attach the global and local embeddings to Stable Diffusion through cross-attention projections to guide generation, using only the primary word for editing.
- Optimize with a combination of diffusion loss and L1 regularization on embeddings (L_global = L_LDM + lambda_global ||v||_1; L_local = L_LDM + lambda_local ||V^l||_1).
- During inference, generate concepts by using the primary word w0 and optionally fuse local details for fidelity.
Experimental results
Research questions
- RQ1Can an encoder learn to map visual concepts to editable textual embeddings faster than optimization-based methods?
- RQ2Does a multi-layer, multi-word global mapping improve editability and fidelity over single-word embeddings?
- RQ3Can a local mapping network inject fine-grained details without harming the ability to edit the primary concept?
- RQ4How does ELITE compare to existing methods in terms of speed, text alignment, and image alignment?
Key findings
| Method | CLIP-T (↑) | CLIP-I (↑) | DINO-I (↑) | Time (↓) |
|---|---|---|---|---|
| Textual Inversion [15] | 0.183 | 0.663 | 0.462 | 50 min |
| DreamBooth [33] | 0.251 | 0.785 | 0.674 | 15 min |
| Custom Diffusion [18] | 0.245 | 0.801 | 0.695 | 6 min |
| Ours | 0.255 | 0.762 | 0.652 | 0.05s |
- ELITE achieves fast concept encoding, finishing in about 0.05 seconds, versus minutes for optimization-based methods.
- Using multi-layer, multi-word global mapping yields a more editable primary word and better concept fidelity than single-layer or single-word variants.
- Incorporating a local mapping network improves local detail consistency with modest impact on editability.
- ELITE shows competitive text alignment and image alignment while delivering substantially faster encoding times compared with Textual Inversion, DreamBooth, and Custom Diffusion.
- User studies indicate strong preference for ELITE in editing alignment and overall satisfaction, while maintaining comparable image-level fidelity to competing methods.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.