QUICK REVIEW

[논문 리뷰] Training-Free Text-to-Image Compositional Food Generation via Prompt Grafting

Xinyue Pan, Yuhao Chen|arXiv (Cornell University)|2026. 01. 25.

Generative Adversarial Networks and Image Synthesis인용 수 0

한 줄 요약

이 논문은 다중 음식 이미지 생성에서 객체 얽힘을 방지하는 훈련 없는 방법인 Prompt Grafting을 소개합니다. 먼저 레이아웃 프롬프트로 구분 가능한 레이아웃을 형성한 다음 대상 음식 프롬프트를 접목하여 미세 조정 없이 다중 항목의 존재감을 향상시킵니다.

ABSTRACT

Real-world meal images often contain multiple food items, making reliable compositional food image generation important for applications such as image-based dietary assessment, where multi-food data augmentation is needed, and recipe visualization. However, modern text-to-image diffusion models struggle to generate accurate multi-food images due to object entanglement, where adjacent foods (e.g., rice and soup) fuse together because many foods do not have clear boundaries. To address this challenge, we introduce Prompt Grafting (PG), a training-free framework that combines explicit spatial cues in text with implicit layout guidance during sampling. PG runs a two-stage process where a layout prompt first establishes distinct regions and the target prompt is grafted once layout formation stabilizes. The framework enables food entanglement control: users can specify which food items should remain separated or be intentionally mixed by editing the arrangement of layouts. Across two food datasets, our method significantly improves the presence of target objects and provides qualitative evidence of controllable separation.

연구 동기 및 목표

구성적 음식 이미지 생성에서 객체 얽힘 및 객체 누락 문제를 해결합니다.
확산 모델의 미세 조정 없이 신뢰할 수 있는 다중 음식 생성을 가능하게 합니다.
텍스트 프롬프트만으로 분리 가능한 레이아웃을 만들고 내용을 채우는 훈련 없는 프레임워크를 제공합니다.
사용자가 어떤 항목을 분리 유지하거나 얽히게 할지 결정할 수 있는 제어 가능한 생성을 제공합니다.

제안 방법

레이아웃 프롬프트에 먼저 조건을 걸어 구분된 영역을 확립한 다음 레이아웃 안정화 후 대상 프롬프트로 접목하는 2단계 확산 샘플링.
레이아웃 중단은 시간에 따라 변화하는 조건 c(t)로, 접목 시점 T에서 c_layout에서 c_target으로 전환됩니다.
레이아웃 안정화를 탐지하기 위해 CLIP 기반의 레이아웃-텍스트 유사성을 모니터링하여 동적 접목 타임스텝(S_lay)을 결정합니다.
최종 정제 과정에서 모든 항목이 한 접시로 무너지는 것을 방지하기 위해 음성 프롬프트를 사용하는 분류기 자유 가이던스를 적용합니다.
레이어 annotations나 모델 미세 조정 없이 텍스트 프롬프트만으로 SD3에 의존합니다.

Figure 1: Example compositional food images generated by stable diffusion v3 model (SD3) and our method with corresponding reference images.

실험 결과

연구 질문

RQ1SD3로 생성된 다중 음식 이미지에서 추가 학습이나 레이아웃 주석 없이 객체 얽힘을 완화할 수 있는가?
RQ2명시적 레이아웃 프롬프트와 공간적 신호를 결합하면 다중 음식의 분리 및 존재감을 향상시킬 수 있는가?
RQ3레이아웃 안정화를 위한 동적 접목 타임스텝 선택이 고정 스텝 접목보다 우수한가?
RQ4PG가 음식 외의 비음식 도메인으로 일반화하는 정도는 어느 정도인가?

주요 결과

PG는 얽힘을 대폭 줄이고 목표 객체 재현을 SD3 및 다른 기준선과 비교하여 향상시켰습니다.
전체 PG(레이아웃 중단 + 공간 신호)는 데이터셋 전반에서 최상의 F1 점수와 BLIP 존재율을 달성했습니다(VFN: F1 0.537; UEC-256: F1 0.165; BLIP-exist ≈ 99.6–99.7%).
동적 접목 타임스텝을 사용하는 것이 고정 스텝 버전에 비해 가장 높은 F1 및 BLIP-exist 점수를 제공합니다.
PG는 레이아웃 분리를 강제하여 배경 다양성을 감소시키므로 일부 기준선에 비해 FID가 증가합니다(예: VFN에서 49.0 vs 40.5).
SC는 공간 지침을 제공하고 레이아웃 중단은 조기 융합을 방지합니다; 안정적인 분리를 위해 두 구성요소가 모두 필요합니다.

Figure 2: Generated image from stable diffusion v1 and stable diffusion v3 model using text prompt: A photo of white rice and soup

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.