QUICK REVIEW

[논문 리뷰] Shifting the Breaking Point of Flow Matching for Multi-Instance Editing

Carmine Zaccagnino, Fabio Quattrini|arXiv (Cornell University)|2026. 02. 09.

Generative Adversarial Networks and Image Synthesis인용 수 0

한 줄 요약

논문은 단일 패스에서 인스턴스 수준의 다중 편집을 가능하게 하는 Instance-Disentangled Attention을 도입하고, 이를 자연 이미지와 새로운 인포그래픽 편집 벤치마크에서 검증합니다.

ABSTRACT

Flow matching models have recently emerged as an efficient alternative to diffusion, especially for text-guided image generation and editing, offering faster inference through continuous-time dynamics. However, existing flow-based editors predominantly support global or single-instruction edits and struggle with multi-instance scenarios, where multiple parts of a reference input must be edited independently without semantic interference. We identify this limitation as a consequence of globally conditioned velocity fields and joint attention mechanisms, which entangle concurrent edits. To address this issue, we introduce Instance-Disentangled Attention, a mechanism that partitions joint attention operations, enforcing binding between instance-specific textual instructions and spatial regions during velocity field estimation. We evaluate our approach on both natural image editing and a newly introduced benchmark of text-dense infographics with region-level editing instructions. Experimental results demonstrate that our approach promotes edit disentanglement and locality while preserving global output coherence, enabling single-pass, instance-level editing.

연구 동기 및 목표

flow 기반 이미지 편집에서 의미 체계 간섭 없이 독립적이고 영역별 편집을 가능하게 동기를 부여하고 구현한다.
Joint attention을 분할하고 인스턴스 프롬프트를 공간 영역에 결합하기 위해 Instance-Disentangled Attention을 개발한다.
해결된 disentangled attention이 다중 인스턴스 편집에서 편집 로컬리티와 전반적 일관성을 향상시킨다는 것을 보여준다.
자연 이미지와 밀집 텍스트 영역이 있는 새로운 인포그래픽 편집 벤치마크에서 평가한다.

제안 방법

전역 속도장을 갖는 조건부 보정된 flow 매칭을 사용한다.
joint attention 토큰을 글로벌, 로컬, latent, 컨텍스트 그룹으로 분할하여 Instance-Disentangled Attention(IDAttn)을 도입한다.
두 가지 마스킹 규 regime(Disentanglement mask M_dis, Harmonization mask M_har)을 적용하여 초기/중간/말단 층 간의 인스턴스 간 간섭을 제어한다.
다중 프롬프트 독립 인코딩 전략을 채택하여 인스턴스 프롬프트의 의미론적 분리를 유지하면서도 효율성을 유지한다.
제안된 마스킹 전략으로 데이터의 부분집합에 대해 Low-Rank Adaptation을 통한 도메인 특화 미세 조정을 선택적으로 수행한다.
Crello Edit와 InfoEdit 데이터셋을 사용한 Infographics Editing Benchmark를 제안하여 인포그래픽의 텍스트 영역 편집을 다룬다.

Figure 1 : Logic visualization of the proposed joint attention masks.

실험 결과

연구 질문

RQ1flow 기반 편집기에서 인스턴스 수준의 격리가 여러 영역을 동시에 편집할 때 속성 누출을 방지하는가?
RQ2Instance-Disentangled Attention이 다중 인스턴스 편집에서 편집 로컬리티, 일관성 및 효율성을 개선하는가?
RQ3다중 프롬프트 독립 인코딩 전략이 비용 부담 없이 프롬프트의 의미론적 분리를 유지하는가?
RQ4텍스트가 밀집한 인포그래픽 편집이 자연 이미지에 비해 이 방법들에 얼마나 잘 전이되는가?

주요 결과

Instance-Disentangled Attention은 프롬프트 준수와 배경 보존을 개선하고 인스턴스 간 간섭을 줄인다.
마스킹 분포: 초기/말단 층의 조화(Harmonization)와 중간 층의 분리가 다른 층 구성보다 프롬프트 추종성과 왜곡를 낮은 수준으로 유지한다.
다중 프롬프트 인코딩이 의미론적 분리를 유지하면서도 일부 지표에서 비용을 허용하는 수준으로 효율성을 제공하고 다수 인스턴스로의 확장을 가능하게 한다.
제안된 방법은 벤치마크에서 기저 방법보다 더 강한 편집 속도와 더 적은 배경 왜곡을 달성한다.
사용자 연구 및 LLM 기반 판단이 경쟁 FLUX 기반 벤치마크보다 제안된 방법을 선호한다.
마스킹 전략으로의 미세 조정은 추가 비용이 modest한 추가 이익을 제공한다.

Figure 2 : CER and AR w.r.t. the number of edits.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.