QUICK REVIEW

[논문 리뷰] DiffusionAgent: Navigating Expert Models for Agentic Image Generation

Jie Qin, Jie Wu|arXiv (Cornell University)|2024. 01. 18.

Multimodal Machine Learning Applications인용 수 9

한 줄 요약

DiffusionGPT는 LLM 기반 시스템을 사용하여 다양한 프롬프트를 파싱하고, 도메인 모델의 사고 트리(Tree-of-Thought)를 구축하며, 인간 피드백으로 전문가 확산 모델을 선택해 도메인 전반의 이미지 생성 품질을 향상시킨다.

ABSTRACT

In the accelerating era of human-instructed visual content creation, diffusion models have demonstrated remarkable generative potential. Yet their deployment is constrained by a dual bottleneck: semantic ambiguity in diverse prompts and the narrow specialization of individual models. A single diffusion architecture struggles to maintain optimal performance across heterogeneous prompts, while conventional "parse-then-call" pipelines artificially separate semantic understanding from generative execution. To bridge this gap, we introduce DiffusionAgent, a unified, language-model-driven agent that casts the entire "prompt comprehension-expert routing-image synthesis" loop into a agentic framework. Our contributions are three-fold: (1) a tree-of-thought-powered expert navigator that performs fine-grained semantic parsing and zero-shot matching to the most suitable diffusion model via an extensible prior-knowledge tree; (2) an advantage database updated with human-in-the-loop feedback, continually aligning model-selection policy with human aesthetic and semantic preferences; and (3) a fully decoupled agent architecture that activates the optimal generative path for open-domain prompts without retraining or fine-tuning any expert. Extensive experiments show that DiffusionAgent retains high generation quality while significantly broadening prompt coverage, establishing a new performance and generality benchmark for multi-domain image synthesis. The code is available at https://github.com/DiffusionAgent/DiffusionAgent

연구 동기 및 목표

다양한 프롬프트를 처리하고 다수의 도메인별 모델을 포함하는 통합 텍스트-이미지 시스템의 필요성을 제시한다.
LLM을 인지적 제어기로 사용하여 전문가 확산 모델을 선택하는 프레임워크를 제안한다.
모델을 조직하고 효율적인 검색 및 선택을 가능하게 하는 Tree-of-Thought 구조를 도입한다.
모델 선택을 인간 선호에 맞추기 위해 인간 피드백이 반영된 Advantage Database를 도입한다.
학습 없이 즉시 사용 가능한 Plug-and-Play 적용성을 오픈 소스 확산 모델 전반에 걸쳐 시연한다.

제안 방법

Prompt Parse Agent는 다양한 입력 형태(프롬프트-, 지시-, 영감-, 가설 기반)에서 핵심 내용을 추출한다.
Models의 Tree-of-Thought는 확장 가능한 모델 구성을 위해 모델 태그로부터 계층적 모델 트리를 구축하고 유지한다.
모델 검색은 TOT를 사용하여 넓이 우선 범주 매칭을 통해 후보 모델 세트를 생성한다.
모델 선택은 Advantage Database를 통한 인간 피드백을 통합하여 상위 모델을 순위 매기고 선택한다.
Prompt Extension Agent는 컨텍스트 내 학습을 활용하여 예시 기반 설명으로 프롬프트를 확장한다.
생성 실행은 선택된 모델을 적용하여 이미지를 생성하고 품질을 높이기 위해 프롬프트를 반복적으로 확장한다.

실험 결과

연구 질문

RQ1통합 프레임워크가 프롬프트 제약을 해제하고 텍스트-이미지 생성에 적합한 도메인 전문가 모델을 활성화할 수 있는가?
RQ2LLM-가이드 Tree-of-Thought와 인간 피드백이 프롬프트와 도메인 전반에서 모델 선택과 출력 품질을 어떻게 개선할 수 있는가?
RQ3TOT+HF 및 프롬프트 확장이 기본 확산 모델과 비교하여 현실감, 의미론, 미학에서 어떤 향상을 가져오는가?

주요 결과

방법	이미지 보상	미적 점수
SD15	0.28	5.26
Random	0.45	5.50
DiffusionGPT wo HF	0.56	5.62
DiffusionGPT	0.63	5.70

DiffusionGPT는 프롬프트 전반에서 이미지 보상(image-reward)과 미적 점수(aesthetic score)에서 SD1.5 기준선보다 더 우수하다 (DiffusionGPT: 0.63 image-reward; 5.70 aesthetic score vs. SD15: 0.28 image-reward; 5.26 aesthetic score).
사용자 연구에서 DiffusionGPT가 생성한 이미지에 대해 기본 모델보다 일관되게 선호도가 높았다.
TOT 및 인간 피드백(HF)은 무작위 모델 선택에 비해 의미적 정렬성과 현실감을 크게 향상시킨다.
프롬프트 확장은 이미지 미학과 디테일을 크게 향상시킨다.
프롬프트 파싱과 TOT 기반 모델 검색은 단순 프롬프트를 넘어서는 다양한 입력 유형을 더 잘 처리하게 한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.