QUICK REVIEW

[Paper Review] ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation

Yuxiang Wei, Yabo Zhang|arXiv (Cornell University)|Feb 27, 2023

Video Analysis and Summarization7 citations

TL;DR

ELITE trains a learning-based encoder with global and local mapping networks to convert visual concepts into textual embeddings, enabling fast, accurate, and editable customized text-to-image generation using pre-trained diffusion models.

ABSTRACT

In addition to the unprecedented ability in imaginary creation, large text-to-image models are expected to take customized concepts in image generation. Existing works generally learn such concepts in an optimization-based manner, yet bringing excessive computation or memory burden. In this paper, we instead propose a learning-based encoder, which consists of a global and a local mapping networks for fast and accurate customized text-to-image generation. In specific, the global mapping network projects the hierarchical features of a given image into multiple new words in the textual word embedding space, i.e., one primary word for well-editable concept and other auxiliary words to exclude irrelevant disturbances (e.g., background). In the meantime, a local mapping network injects the encoded patch features into cross attention layers to provide omitted details, without sacrificing the editability of primary concepts. We compare our method with existing optimization-based approaches on a variety of user-defined concepts, and demonstrate that our method enables high-fidelity inversion and more robust editability with a significantly faster encoding process. Our code is publicly available at https://github.com/csyxwei/ELITE.

Motivation & Objective

Enable fast and accurate customized text-to-image generation from a small set of concept images.
Replace optimization-based concept learning with a learning-based encoder.
Leverage multi-layer CLIP features to create a robust, editable primary concept word.
Incorporate a local mapping network to inject detailed, location-specific information without losing editability.
Demonstrate superior speed and competitive fidelity/editability against existing methods.

Proposed method

Use a pre-trained CLIP image encoder to extract hierarchical features from the concept image.
Train a global mapping network to produce multiple word embeddings from CLIP features, forming a primary word and auxiliary words for disturbances.
Train a local mapping network to encode foreground details into textual feature space and inject via cross-attention to preserve local details.
Attach the global and local embeddings to Stable Diffusion through cross-attention projections to guide generation, using only the primary word for editing.
Optimize with a combination of diffusion loss and L1 regularization on embeddings (L_global = L_LDM + lambda_global ||v||_1; L_local = L_LDM + lambda_local ||V^l||_1).
During inference, generate concepts by using the primary word w0 and optionally fuse local details for fidelity.

Experimental results

Research questions

RQ1Can an encoder learn to map visual concepts to editable textual embeddings faster than optimization-based methods?
RQ2Does a multi-layer, multi-word global mapping improve editability and fidelity over single-word embeddings?
RQ3Can a local mapping network inject fine-grained details without harming the ability to edit the primary concept?
RQ4How does ELITE compare to existing methods in terms of speed, text alignment, and image alignment?

Key findings

Method	CLIP-T (↑)	CLIP-I (↑)	DINO-I (↑)	Time (↓)
Textual Inversion [15]	0.183	0.663	0.462	50 min
DreamBooth [33]	0.251	0.785	0.674	15 min
Custom Diffusion [18]	0.245	0.801	0.695	6 min
Ours	0.255	0.762	0.652	0.05s

ELITE achieves fast concept encoding, finishing in about 0.05 seconds, versus minutes for optimization-based methods.
Using multi-layer, multi-word global mapping yields a more editable primary word and better concept fidelity than single-layer or single-word variants.
Incorporating a local mapping network improves local detail consistency with modest impact on editability.
ELITE shows competitive text alignment and image alignment while delivering substantially faster encoding times compared with Textual Inversion, DreamBooth, and Custom Diffusion.
User studies indicate strong preference for ELITE in editing alignment and overall satisfaction, while maintaining comparable image-level fidelity to competing methods.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.