QUICK REVIEW

[논문 리뷰] Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers

BinXu Wang, Jingxuan Fan|arXiv (Cornell University)|2026. 01. 09.

Generative Adversarial Networks and Image Synthesis인용 수 0

한 줄 요약

본 논문은 Diffusion Transformers가 물체 간의 공간 관계를 어떻게 생성하는지 조사하고, 텍스트 인코더에 따라 두 가지 뚜렷한 회로가 나타남을 밝히며(무작위 임베딩 대 T5), 관계 및 물체 생성에 관여하는 인과적이고 해석 가능한 Head 수준의 메커니즘을 제시한다.

ABSTRACT

Diffusion Transformers (DiTs) have greatly advanced text-to-image generation, but models still struggle to generate the correct spatial relations between objects as specified in the text prompt. In this study, we adopt a mechanistic interpretability approach to investigate how a DiT can generate correct spatial relations between objects. We train, from scratch, DiTs of different sizes with different text encoders to learn to generate images containing two objects whose attributes and spatial relations are specified in the text prompt. We find that, although all the models can learn this task to near-perfect accuracy, the underlying mechanisms differ drastically depending on the choice of text encoder. When using random text embeddings, we find that the spatial-relation information is passed to image tokens through a two-stage circuit, involving two cross-attention heads that separately read the spatial relation and single-object attributes in the text prompt. When using a pretrained text encoder (T5), we find that the DiT uses a different circuit that leverages information fusion in the text tokens, reading spatial-relation and single-object information together from a single text token. We further show that, although the in-domain performance is similar for the two settings, their robustness to out-of-domain perturbations differs, potentially suggesting the difficulty of generating correct relations in real-world scenarios.

연구 동기 및 목표

확산 기반 텍스트-이미지 모델이 다중 물체 간의 공간 관계를 어떻게 생성하는지 이해한다.
텍스트 관계를 이미지 레이아웃으로 번역하는 회로 및 헤드를 식별한다.
랜덤 텍스트 임베딩과 사전 학습된 텍스트 인코더(T5)를 사용할 때의 메커니즘을 비교한다.
학습된 회로의 도메인 외 교란에 대한 강건성을 평가한다.

제안 방법

두 개의 물체 간 최소 공간 관계 작업에서 다양한 크기의 Diffusion Transformer 모델을 처음부터 학습시킨다.
세 가지 텍스트 인코더를 비교한다: 랜덤 임베딩(RTE), 위치 인코딩 없이의 RTE, 그리고 사전학습된 T5 인코더.
다층/헤드/시간 스텝 전반의 교차-어텐션 패턴을 요약하는 Attention Synopsis를 개발한다.
확인된 헤드에 대한 차단 실험(ablation) 및 인과적 조작을 수행해 그 역할을 확립한다.
T5 기반 모델이 임베딩 공간 조작을 통해 관계 정보를 물체 토큰에 어떻게 통합하는지 분석한다.

Figure 1 : Schematics of the model and task . Our T2I model architecture adopted the design of PixArt [ 5 ] . There are three main components: the text encoder that processes tokenized natural language prompts into text embeddings, the VAE that processes image inputs into image tokens, and the Diffu

실험 결과

연구 질문

RQ1Diffusion Transformers에서 공간 관계 생성을 구현하는 회로 또는 헤드는 무엇인가?
RQ2텍스트 인코더 선택이 관계 생성의 내부 메커니즘에 어떤 영향을 미치는가?
RQ3프롬프트의 교란에 대해 관계 생성 메커니즘이 강건한가?
RQ4인과적 개입이 정확한 공간 배치를 위한 식별된 헤드의 필요성을 입증할 수 있는가?
RQ5RTE 기반과 T5 기반 DiT의 해석가능성과 일반화에 어떤 차이가 있는가?

주요 결과

RTE 기반 DiTs는 물체를 정확히 배치하기 위해 공간 관계 헤드와 물체 생성 헤드의 두 단계 회로를 사용한다.
공간 관계 헤드(L2H8)는 관계 단어를 이미지 영역에 결합시키는 QK 유사 회로를 통해 공간 그래디언트를 생성한다.
물체 생성 헤드(L4H3)는 물체 형체를 라벨링된 위치와 연결하고, 이 헤드를 제거하면 생성의 해당 측면이 방해를 받는다.
관계 헤드에서 물체 헤드로의 VO(visual-embedding) 주입은 올바른 물체 배치를 유도하기에 충분하며 인과적 연결을 보여준다.
T5 기반 DiTs는 맥락적 형태 토큰에 관계 정보를 인코딩하여 단일 토큰에서 생성을 가능하게 하지만 분리성 및 프롬프트 강건성을 저하시킨다.
Attention Synopsis는 RTE 기반 모델에서 명확하고 분리 가능한 회로를 드러내는 반면, T5 기반 모델은 물체 토큰 내에 내재된 관계적 신호에 의존하여 해석성 및 강건성 프로필이 달라진다.

Figure 2 : Training dynamics of the T2I models (DiT-B) . A) and B) Both models trained with random token embedding and T5 can achieve good accuracy on the task. Solid lines shows the result of model using exponential moving averaged ( ema ) weights, while dashed line shows the non-averaged weights.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.