QUICK REVIEW

[论文解读] Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers

BinXu Wang, Jingxuan Fan|arXiv (Cornell University)|Jan 9, 2026

Generative Adversarial Networks and Image Synthesis被引用 0

一句话总结

该论文研究扩散Transformer如何在对象之间生成空间关系，揭示依据文本编码器（随机嵌入 vs. T5）存在的两种不同电路，并展示了决定关系与对象生成的因果、可解释的头部级机制。

ABSTRACT

Diffusion Transformers (DiTs) have greatly advanced text-to-image generation, but models still struggle to generate the correct spatial relations between objects as specified in the text prompt. In this study, we adopt a mechanistic interpretability approach to investigate how a DiT can generate correct spatial relations between objects. We train, from scratch, DiTs of different sizes with different text encoders to learn to generate images containing two objects whose attributes and spatial relations are specified in the text prompt. We find that, although all the models can learn this task to near-perfect accuracy, the underlying mechanisms differ drastically depending on the choice of text encoder. When using random text embeddings, we find that the spatial-relation information is passed to image tokens through a two-stage circuit, involving two cross-attention heads that separately read the spatial relation and single-object attributes in the text prompt. When using a pretrained text encoder (T5), we find that the DiT uses a different circuit that leverages information fusion in the text tokens, reading spatial-relation and single-object information together from a single text token. We further show that, although the in-domain performance is similar for the two settings, their robustness to out-of-domain perturbations differs, potentially suggesting the difficulty of generating correct relations in real-world scenarios.

研究动机与目标

理解基于扩散的文本到图像模型如何在多个对象之间生成空间关系。
识别将文本关系转换为图像布局的电路和头部。
在使用随机文本嵌入与预训练文本编码器（T5）时，比较机制差异。
评估所学电路对域外扰动的鲁棒性。

提出的方法

从零开始训练不同规模的扩散Transformer模型，在一个最小的两对象空间关系任务上进行训练。
比较三种文本编码器：随机嵌入（RTE）、无位置编码的RTE，以及预训练的T5编码器。
开发Attention Synopsis，总结跨层、跨头、跨时间步的跨注意模式。
对已识别的头进行消融和因果操作，以确立它们的作用。
分析基于T5的模型如何通过嵌入空间操作将关系信息整合到对象token中。

Figure 1 : Schematics of the model and task . Our T2I model architecture adopted the design of PixArt [ 5 ] . There are three main components: the text encoder that processes tokenized natural language prompts into text embeddings, the VAE that processes image inputs into image tokens, and the Diffu

实验结果

研究问题

RQ1在扩散Transformer中，哪些电路或头部实现空间关系生成？
RQ2文本编码器的选择如何影响关系生成的内部机制？
RQ3关系生成机制对提示扰动是否鲁棒？
RQ4因果干预是否能证明所识别头对准确的空间布局的必要性？
RQ5基于RTE和基于T5的DiTs在可解释性和推广性方面有何不同？

主要发现

基于RTE的DiTs使用两阶段电路，其中一个空间关系头和一个对象生成头来准确放置对象。
空间关系头（L2H8）通过受QK启发的电路创建空间梯度，将关系词绑定到图像区域。
对象生成头（L4H3）将对象形状与它们标记的位置相连，消融这些头部会破坏相应的生成特征。
来自关系头的视觉嵌入注入到对象头即可诱导正确的对象放置，显示因果链接。
基于T5的DiTs将关系信息编码在上下文形状token中，能够从单一token生成，但降低了解耦和提示鲁棒性。
Attention Synopsis揭示RTE基模型中存在清晰、可区分的电路，而T5基模型则依赖于对象token中嵌入的关系线索，导致可解释性和鲁棒性侧性不同。

Figure 2 : Training dynamics of the T2I models (DiT-B) . A) and B) Both models trained with random token embedding and T5 can achieve good accuracy on the task. Solid lines shows the result of model using exponential moving averaged ( ema ) weights, while dashed line shows the non-averaged weights.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。