QUICK REVIEW

[論文レビュー] Rethinking Global Text Conditioning in Diffusion Transformers

Nikita Starodubcev, Daniil Pakhomov|arXiv (Cornell University)|Feb 9, 2026

Generative Adversarial Networks and Image Synthesis被引用数 0

ひとこと要約

要約: 本論文は、プールド global text embeddings の従来の使用法では影響が限定的だが、動的かつ訓練不要な場合に拡散トランスフォーマーに対して強いモジュレーション指向の利点を提供できることを示している。テキストから画像/動画生成と編集タスクにおいて、訓練不要で適用可能な方法としての有用性を示唆する。

ABSTRACT

Diffusion transformers typically incorporate textual information via attention layers and a modulation mechanism using a pooled text embedding. Nevertheless, recent approaches discard modulation-based text conditioning and rely exclusively on attention. In this paper, we address whether modulation-based text conditioning is necessary and whether it can provide any performance advantage. Our analysis shows that, in its conventional usage, the pooled embedding contributes little to overall performance, suggesting that attention alone is generally sufficient for faithfully propagating prompt information. However, we reveal that the pooled embedding can provide significant gains when used from a different perspective-serving as guidance and enabling controllable shifts toward more desirable properties. This approach is training-free, simple to implement, incurs negligible runtime overhead, and can be applied to various diffusion models, bringing improvements across diverse tasks, including text-to-image/video generation and image editing.

研究の動機と目的

pooled CLIP ベースのグローバルテキスト条件付けと attention ベース条件付けの実質的な貢献を拡散トランスフォーマーで評価する。
モジュレーション指向を訓練不要で軽量な方法として検討し、拡散モデルを望ましい性質へ導く。
テキストから画像、テキストから動画、画像編集タスク全般で生成品質を向上させる動的モジュレーション戦略を開発する。
プールド埋め込みを完全に注意ベースのモデルへ実用的に統合し、性能を向上させる。

提案手法

CLIP のプールド埋め込みが、いくつかの拡散モデル変種（FLUX schnell、HiDream-Fast、COSMOS）における役割をアブレーション（CLIP 埋め込みを除去するか保持するか）で分析。
グローバル conditioning y(p,t) を正負のプロンプトの加重差分で補うモジュレーション空間指向の定式化を導入： ŷ(p,t)=y(p,t)+w·(y(p+,t)−y(p−,t))。
モデル層全体で指向重みを変化させ、美学とプロンプト忠実度のバランスを取るスキップ戦略を用いた動的モジュレーション指向を提案。
プールド埋め込みを CLIP フリーなモデルへ統合する方法として、小さな MLP をプールド埋め込みの上に訓練し、合成データで蒸留を行う。
テキストから画像、テキストから動画、指示に基づく画像編集タスクを、人間の嗜好と自動指標で評価。

Figure 1: (top) Difference between images (DreamSim) with and without CLIP as a function of prompt length. (bot) For long prompts, images without CLIP generally do not differ from the initial ones.

実験結果

リサーチクエスチョン

RQ1 従来の拡散モデル条件付けにおいて、プールド CLIP 埋め込みは生成品質に著しく影響を与えるか。
RQ2 プールド埋め込みを再利用して、訓練を追加せずに美学・複雑さ・特定の編集を向上させる制御可能なモジュレーション指向として機能するか。
RQ3 動的モジュレーション指向は、一定指向よりもタスクやプロンプト全般で効果的か。
RQ4 モジュレーション指向を用いて、全モデルの再訓練なしにプールド埋め込みを統合して CLIP フリーなモデルの性能を改善できるか。
RQ5 モジュレーション指向は、テキストから画像、テキストから動画、指示に基づく画像編集のベンチマークでどう機能するか。

主な発見

従来の条件付け経路で使用した場合、プールド CLIP 埋め込みは性能に対する寄与が小さいことが多く、テキスト整合性には注意のみによって十分な場合が多い。
モジュレーション指向として使用すると、訓練なしで望ましい性質へ導く制御可能な変化を可能にし、生成を大幅に支援できる。
動的モジュレーション指向は一定指向に比べ、美学とプロンプト忠実度のバランスで優れており、タスクを跨いだ堅牢な一般化を示す。
小さな MLP と蒸留を用いたプールド埋め込みの CLIP フリーモデルへの統合は、元のモデルの再訓練なしで生成品質を改善。
テキストから画像/動画、画像編集タスクでの実験は、対象物検出数や手の形状修正の改善を含む人間評価と自動指標の有利な結果を示す。

Figure 2: The modulation guidance enables local (top) and global (bottom) changes and encourages its use to shift a DM toward modes with better properties.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。