QUICK REVIEW

[論文レビュー] The Unreasonable Effectiveness of Text Embedding Interpolation for Continuous Image Steering

Yiğit Ekin, Yossi Gandelsman|arXiv (Cornell University)|Mar 18, 2026

Generative Adversarial Networks and Image Synthesis被引用数 0

ひとこと要約

トレーニング不要のフレームワーク。テキスト条件付き生成モデルのテキスト埋め込み空間を操作することで継続的・可控な画像編集を実現。LLMドリブンのパイプラインで偏りを抑えた対比的プロンプトを構築し、滑らかな編集のための弾性範囲探索を用いる。

ABSTRACT

We present a training-free framework for continuous and controllable image editing at test time for text-conditioned generative models. In contrast to prior approaches that rely on additional training or manual user intervention, we find that a simple steering in the text-embedding space is sufficient to produce smooth edit control. Given a target concept (e.g., enhancing photorealism or changing facial expression), we use a large language model to automatically construct a small set of debiased contrastive prompt pairs, from which we compute a steering vector in the generator's text-encoder space. We then add this vector directly to the input prompt representation to control generation along the desired semantic axis. To obtain a continuous control, we propose an elastic range search procedure that automatically identifies an effective interval of steering magnitudes, avoiding both under-steering (no-edit) and over-steering (changing other attributes). Adding the scaled versions of the same vector within this interval yields smooth and continuous edits. Since our method modifies only textual representations, it naturally generalizes across text-conditioned modalities, including image and video generation. To quantify the steering continuity, we introduce a new evaluation metric that measures the uniformity of semantic change across edit strengths. We compare the continuous editing behavior across methods and find that, despite its simplicity and lightweight design, our approach is comparable to training-based alternatives, outperforming other training-free methods.

研究の動機と目的

再学習や追加モジュールなしに、細粒度の画像編集のための軽量でプラグアンドプレイ型アプローチを動機づける。
テキストエンコーダ表現への単純な線形介入だけで、継続的な制御が達成可能であることを示す。
意味的に焦点を絞った編集を確保するため、対比的プロンプトの自動構築とトークン選択をLLMで行う。
滑らかな編集のための効果的な steering 強度範囲を特定する適応的でデータ駆動的な方法を開発する。
編集強度 across にわたる意味的変化の連続性を評価する新しい指標を提案する。

提案手法

LLM によって識別されたトークン表現をプーリングし、偏りを抑えた対比プロンプトペアの差の平均を取って、テキスト埋め込み空間での steering 方向を計算する。
steering ベクトルをテキストエンコーダ入力表現に加え、生成を望ましい意味軸に沿って導く（方程式 2）。
弾性範囲探索を用いて効果的な steering の強度区間を自動的に特定し、この区間内でベクトルのスケール版を適用して継続的な編集を行う（セクション 3.3 のアルゴリズムと説明）。
LLM を用いてトークン選択を自動化し、概念関連トークンを steering し、編集を局所的・全体的・スタイル化のカテゴリに分類する（セクション 3.2）。
スタイル・トークンプーリングによる偏り除去を用いて、対象属性を絡んだ手掛かりから分離する（セクション 3.1.3）。
新しい連続性指標（MID dist）を導入し、編集強度にわたる意味的変化の一様性を量化する（セクション 4.3）。

Figure 1: Our framework. Given a user text prompt, our method enables controllable editing in text-to-image generation without retraining. (a) In the default setting, the prompt is encoded by the text encoder and used by the generative pipeline to produce an image. (b) To introduce edit control, we

実験結果

リサーチクエスチョン

RQ1トレーニングやアーキテクチャ変更なしに、テキストエンコーダ表現だけを steering することで、継続的で解釈可能な画像編集を実現できるか。
RQ2対比プロンプト生成とトークン選択を自動化した LLM 主導のパイプラインは、多様な概念に対して堅牢で意味的に焦点を絞った編集を可能にするか。
RQ3弾性範囲探索は滑らかで知覚的に一貫した編集を提供し、バックボーンモデル全体での過不足編集を回避できるか。
RQ4テキスト埋め込み空間 steering は、トレーニングベースの手法と比較して編集強度・内容保持・スライダーの連続性の観点でどのように異なるか。
RQ5このアプローチは、画像だけでなく動画モダリティを含む異なるテキスト条件付き生成器に移植可能か。

主な発見

提案されたテキスト埋め込み steering フレームワークは、強力なバックボーンに対してトレーニングベースのコントローラと比べて競合する制御性を達成する。
弾性範囲探索により、知覚的に滑らかな編集を生み出す steering 大きさを自動的に特定でき、過不足のアーティファクトを回避する。
LLM ガイド付きトークン選択とスタイル・トークンプーリングによる偏り除去は、概念特異的で局所化された編集と、より良い内容保持をもたらす。
本手法はテキストエンコーダ空間でのみ動作するため、動画生成を含むテキスト条件付きモダリティにも適用可能で、軽量で一般化性が高い。
トレーニング不要なベースラインと比較して、編集の適合性とスライダ挙動がより強く、強力なバックボーンではトレーニングベース手法と同等の性能を示す。

Figure 2: Illustration of bias inheritance in steering. When the age direction is computed from a biased dataset (e.g., predominantly old men), the resulting steering vector entangles gender with age. Consequently, age manipulations not only modify apparent age but also introduce unintended gender-r

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。