QUICK REVIEW

[論文レビュー] Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers

BinXu Wang, Jingxuan Fan|arXiv (Cornell University)|Jan 9, 2026

Generative Adversarial Networks and Image Synthesis被引用数 0

ひとこと要約

この論文はDiffusion Transformersが物体間の空間関係を生成する仕組みを調べ、テキストエンコーダー（ランダム埋め込み vs. T5）に応じて2つの異なる回路を明らかにし、関係と物体生成を支配する因果的で解釈可能なHeadレベルのメカニズムを示す。

ABSTRACT

Diffusion Transformers (DiTs) have greatly advanced text-to-image generation, but models still struggle to generate the correct spatial relations between objects as specified in the text prompt. In this study, we adopt a mechanistic interpretability approach to investigate how a DiT can generate correct spatial relations between objects. We train, from scratch, DiTs of different sizes with different text encoders to learn to generate images containing two objects whose attributes and spatial relations are specified in the text prompt. We find that, although all the models can learn this task to near-perfect accuracy, the underlying mechanisms differ drastically depending on the choice of text encoder. When using random text embeddings, we find that the spatial-relation information is passed to image tokens through a two-stage circuit, involving two cross-attention heads that separately read the spatial relation and single-object attributes in the text prompt. When using a pretrained text encoder (T5), we find that the DiT uses a different circuit that leverages information fusion in the text tokens, reading spatial-relation and single-object information together from a single text token. We further show that, although the in-domain performance is similar for the two settings, their robustness to out-of-domain perturbations differs, potentially suggesting the difficulty of generating correct relations in real-world scenarios.

研究の動機と目的

拡散ベースのテキストから画像へのモデルが複数の物体間の空間関係をどのように生成するかを理解する。
テキスト関係を画像レイアウトへ翻訳する回路とHeadを同定する。
ランダムなテキスト埋め込みと事前学習済みテキストエンコーダー（T5）を使用した場合の機構を比較する。
学習した回路がドメイン外の摂動に対してどの程度頑健かを評価する。

提案手法

最小2物体の空間関係タスクでサイズを変えたDiffusion Transformerモデルをゼロから訓練する。
3つのテキストエンコーダを比較する：ランダム埋め込み（RTE）、位置情報エンコーディングなしのRTE、事前学習済みT5エンコーダ。
層・Head・時間ステップ全体のクロスアテンションパターンを要約するAttention Synopsisを開発。
特定Headの役割を確立するためのアブレーションと因果操作を実施。
T5ベースのモデルが埋め込み空間操作を介して関係情報をオブジェクトトークンに統合する方法を分析。

Figure 1 : Schematics of the model and task . Our T2I model architecture adopted the design of PixArt [ 5 ] . There are three main components: the text encoder that processes tokenized natural language prompts into text embeddings, the VAE that processes image inputs into image tokens, and the Diffu

実験結果

リサーチクエスチョン

RQ1Diffusion Transformersで空間関係生成を実装する回路やHeadはどれか。
RQ2テキストエンコーダの選択が関係生成の内部機構にどのような影響を与えるか。
RQ3関係生成機構はプロンプトの摂動に頑健か。
RQ4識別されたHeadの因果介入が正確な空間レイアウトに対して必要条件であることを示せるか。
RQ5RTEベースとT5ベースのDiTは解釈性と一般化にどう差があるか。

主な発見

RTEベースのDiTは、空間関係Headと物体生成Headの2段階回路を用いて物体を正確に配置する。
空間関係Head（L2H8）はQK風の回路により関係語を画像領域へ結合する空間勾配を作成する。
物体生成Head（L4H3）は物体形状をタグ付き場所と結びつけ、これらのHeadをアブレートすると生成の対応部分が崩れる。
関係Headから物体HeadへのVisual-Embedding（VO）注入は正しい物体配置を誘発する因果的連結を示す。
T5ベースのDiTは関係情報を文脈的形状トークンに埋め込みとして符号化し、単一トークンからの生成を可能にするが、分離性とプロンプトの頑健性を低下させる。
Attention SynopsisはRTEベースモデルで明確で分離可能な回路を示す一方、T5ベースモデルはオブジェクトトークン内の埋め込み関係手掛かりに依存しており、解釈性と頑健性のプロフィールが異なる。

Figure 2 : Training dynamics of the T2I models (DiT-B) . A) and B) Both models trained with random token embedding and T5 can achieve good accuracy on the task. Solid lines shows the result of model using exponential moving averaged ( ema ) weights, while dashed line shows the non-averaged weights.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。