QUICK REVIEW

[논문 리뷰] Controllable Text-to-Image Generation

Bowen Li, Xiaojuan Qi|arXiv (Cornell University)|2019. 09. 16.

Generative Adversarial Networks and Image Synthesis인용 수 79

한 줄 요약

ControlGAN은 단어 수준의 채널별 및 공간 주의집중, 단어 수준 판별기, 그리고 지각 손실을 도입하여 자연어로 안내된 속성 특이적 이미지 조작을 가능하게 하며, CUB와 COCO에서 최첨단 성능을 능가한다.

ABSTRACT

In this paper, we propose a novel controllable text-to-image generative adversarial network (ControlGAN), which can effectively synthesise high-quality images and also control parts of the image generation according to natural language descriptions. To achieve this, we introduce a word-level spatial and channel-wise attention-driven generator that can disentangle different visual attributes, and allow the model to focus on generating and manipulating subregions corresponding to the most relevant words. Also, a word-level discriminator is proposed to provide fine-grained supervisory feedback by correlating words with image regions, facilitating training an effective generator which is able to manipulate specific visual attributes without affecting the generation of other content. Furthermore, perceptual loss is adopted to reduce the randomness involved in the image generation, and to encourage the generator to manipulate specific attributes required in the modified text. Extensive experiments on benchmark datasets demonstrate that our method outperforms existing state of the art, and is able to effectively manipulate synthetic images using natural language descriptions. Code is available at https://github.com/mrlibw/ControlGAN.

연구 동기 및 목표

정밀하게 제어 가능한 텍스트-이미지 생성을 위한 필요성 제시.
관련 없는 콘텐츠를 바꾸지 않고 특정 시각적 속성을 수정할 수 있는 프레임워크 제안.
단어 수준 및 채널 수준 주의집중을 통해 시각적 속성의 해방(partial disentanglement) 구현.
지각 손실을 활용하여 생성된 이미지가 수정되지 않은 콘텐츠와 의미적으로 일치하도록 제약.

제안 방법

속성의 해방(disentangle)을 위해 단어 수준의 공간 및 채널 주의집중을 갖는 다중 단계 생성기 도입.
세밀한 피드백을 위해 단어를 이미지 하위 영역에 연결하는 단어 수준 판별기 제안.
생성된 이미지가 수정되지 않은 콘텐츠와 의미적으로 정렬되도록 지각 손실 채택.
다단계에서 적대적 손실, 텍스트-이미지 상관관계 손실, 지각 손실, DAMSM 기반 손실의 조합으로 학습.
StackGAN++ 및 AttnGAN과의 비교를 위한 CUB 및 COCO에서의 정량적 및 정성적 분석 평가.

실험 결과

연구 질문

RQ1ControlGAN이 텍스트에 조건부로 특정 시각적 속성을 해방시키고 관련 없는 콘텐츠를 변경하지 않으면서 조작할 수 있는가?
RQ2채널별 주의집중이 단어와 이미지 채널 간 정렬을 개선하여 속성 제어에 기여하는가?
RQ3단어 수준 판별기가 더 세밀한 피드백을 제공하여 제어 가능성과 이미지 품질을 향상시키는가?
RQ4지각 손실이 텍스트 guided 편집 시 무작위성을 줄이고 수정되지 않은 콘텐츠를 보존하는 데 어떤 영향을 미치는가?

주요 결과

방법	IS	Top-1 Acc(%)	L2 error
StackGAN++	4.04 \u0000b1 .05	45.28 \u0000b1 3.72	0.29
AttnGAN	4.36 \u0000b1 .03	67.82 \u0000b1 4.43	0.26
Ours	4.58 \u0000b1 .09	69.33 \u0000b1 3.23	0.18
StackGAN++	8.30 \u0000b1 .10	72.83 \u0000b1 3.17	0.32
AttnGAN	25.89 \u0000b1 .47	85.47 \u0000b1 3.69	0.40
Ours	24.06 \u0000b1 .60	82.43 \u0000b1 2.43	0.17

ControlGAN은 CUB에서 StackGAN++ 및 AttnGAN과 비교하여 Inception Score와 R-precision이 더 높다.
ControlGAN은 COCO에서 경쟁력 있는 Inception Score와 R-precision을 제공하되 재구성 오차가 더 낮은 경우가 많다.
L2 재구성 오차는 두 데이터셋 모두에서 ControlGAN이 최저로, 수정되지 않은 콘텐츠의 보존이 더 잘 이루어짐을 나타낸다.
정성적 결과는 수정된 텍스트와 일치하는 제어 가능한 속성 조작이 다른 콘텐츠를 보존하면서 수행됨을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.