Skip to main content
QUICK REVIEW

[Paper Review] ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation

Yuxiang Wei, Yabo Zhang|arXiv (Cornell University)|Feb 27, 2023
Video Analysis and Summarization7 citations
TL;DR

ELITE trains a learning-based encoder with global and local mapping networks to convert visual concepts into textual embeddings, enabling fast, accurate, and editable customized text-to-image generation using pre-trained diffusion models.

ABSTRACT

In addition to the unprecedented ability in imaginary creation, large text-to-image models are expected to take customized concepts in image generation. Existing works generally learn such concepts in an optimization-based manner, yet bringing excessive computation or memory burden. In this paper, we instead propose a learning-based encoder, which consists of a global and a local mapping networks for fast and accurate customized text-to-image generation. In specific, the global mapping network projects the hierarchical features of a given image into multiple new words in the textual word embedding space, i.e., one primary word for well-editable concept and other auxiliary words to exclude irrelevant disturbances (e.g., background). In the meantime, a local mapping network injects the encoded patch features into cross attention layers to provide omitted details, without sacrificing the editability of primary concepts. We compare our method with existing optimization-based approaches on a variety of user-defined concepts, and demonstrate that our method enables high-fidelity inversion and more robust editability with a significantly faster encoding process. Our code is publicly available at https://github.com/csyxwei/ELITE.

Motivation & Objective

  • Enable fast and accurate customized text-to-image generation from a small set of concept images.
  • Replace optimization-based concept learning with a learning-based encoder.
  • Leverage multi-layer CLIP features to create a robust, editable primary concept word.
  • Incorporate a local mapping network to inject detailed, location-specific information without losing editability.
  • Demonstrate superior speed and competitive fidelity/editability against existing methods.

Proposed method

  • Use a pre-trained CLIP image encoder to extract hierarchical features from the concept image.
  • Train a global mapping network to produce multiple word embeddings from CLIP features, forming a primary word and auxiliary words for disturbances.
  • Train a local mapping network to encode foreground details into textual feature space and inject via cross-attention to preserve local details.
  • Attach the global and local embeddings to Stable Diffusion through cross-attention projections to guide generation, using only the primary word for editing.
  • Optimize with a combination of diffusion loss and L1 regularization on embeddings (L_global = L_LDM + lambda_global ||v||_1; L_local = L_LDM + lambda_local ||V^l||_1).
  • During inference, generate concepts by using the primary word w0 and optionally fuse local details for fidelity.

Experimental results

Research questions

  • RQ1Can an encoder learn to map visual concepts to editable textual embeddings faster than optimization-based methods?
  • RQ2Does a multi-layer, multi-word global mapping improve editability and fidelity over single-word embeddings?
  • RQ3Can a local mapping network inject fine-grained details without harming the ability to edit the primary concept?
  • RQ4How does ELITE compare to existing methods in terms of speed, text alignment, and image alignment?

Key findings

MethodCLIP-T (↑)CLIP-I (↑)DINO-I (↑)Time (↓)
Textual Inversion [15]0.1830.6630.46250 min
DreamBooth [33]0.2510.7850.67415 min
Custom Diffusion [18]0.2450.8010.6956 min
Ours0.2550.7620.6520.05s
  • ELITE achieves fast concept encoding, finishing in about 0.05 seconds, versus minutes for optimization-based methods.
  • Using multi-layer, multi-word global mapping yields a more editable primary word and better concept fidelity than single-layer or single-word variants.
  • Incorporating a local mapping network improves local detail consistency with modest impact on editability.
  • ELITE shows competitive text alignment and image alignment while delivering substantially faster encoding times compared with Textual Inversion, DreamBooth, and Custom Diffusion.
  • User studies indicate strong preference for ELITE in editing alignment and overall satisfaction, while maintaining comparable image-level fidelity to competing methods.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.