QUICK REVIEW

[논문 리뷰] PasteGAN: A Semi-Parametric Method to Generate Image from Scene Graph

Yikang Li, Tao Ma|arXiv (Cornell University)|2019. 05. 05.

Multimodal Machine Learning Applications참고 문헌 33인용 수 42

한 줄 요약

PasteGAN은 장면 그래프에서 외부 객체 크롭을 앵커로 사용하고, Crop Refining Network와 Object-Image Fuser를 통해 이미지를 생성하며, 호환 가능한 크롭을 검색하는 Crop Selector를 갖춘다. Visual Genome과 COCO-Stuff에서 SOTA보다 더 높은 IS/다양성(Diversity)과 더 낮은 FID를 달성한다.

ABSTRACT

Despite some exciting progress on high-quality image generation from structured(scene graphs) or free-form(sentences) descriptions, most of them only guarantee the image-level semantical consistency, i.e. the generated image matching the semantic meaning of the description. They still lack the investigations on synthesizing the images in a more controllable way, like finely manipulating the visual appearance of every object. Therefore, to generate the images with preferred objects and rich interactions, we propose a semi-parametric method, PasteGAN, for generating the image from the scene graph and the image crops, where spatial arrangements of the objects and their pair-wise relationships are defined by the scene graph and the object appearances are determined by the given object crops. To enhance the interactions of the objects in the output, we design a Crop Refining Network and an Object-Image Fuser to embed the objects as well as their relationships into one map. Multiple losses work collaboratively to guarantee the generated images highly respecting the crops and complying with the scene graphs while maintaining excellent image quality. A crop selector is also proposed to pick the most-compatible crops from our external object tank by encoding the interactions around the objects in the scene graph if the crops are not provided. Evaluated on Visual Genome and COCO-Stuff dataset, our proposed method significantly outperforms the SOTA methods on Inception Score, Diversity Score and Fréchet Inception Distance. Extensive experiments also demonstrate our method's ability to generate complex and diverse images with given objects.

연구 동기 및 목표

장면 그래프에서 생성된 이미지의 객체 외형에 대해 미세한 제어를 가능하게 한다.
장면 그래프 구조를 존중하면서 렌더링를 안내하기 위해 외부 객체 크롭을 사용하는 준매개적 프레임워크를 제안한다.
사용자가 지정한 객체 외형이 없는 시나리오를 다루기 위해 자동 크롭 선택을 가능하게 한다.
객체 외형과 관계를 하나의 잠재 캔버스에 융합하여 고품질의 이미지 합성을 달성한다.

제안 방법

그래프 컨볼루션 네트워크로 장면 그래프를 표현하여 객체별 컨텍스트 벡터를 얻는다.
객체 크롭을 인코딩하고 관계 인식 특징과 융합하는 Crop Refining Network를 도입한다 (Object 2 Refiner).
장면 그래프 관계에 의해 안내되는 잠재적 장면 캔버스에 객체 특징을 주입하기 위해 주의(attention)를 갖춘 Object-Image Fuser를 사용한다.
장면 그래프 맥락에 따라 외부 객체 탱크에서 가장 호환 가능한 객체 크롭을 검색하는 Crop Selector를 추가한다.
크롭, 객체, 장면 배치를 정렬하기 위해 재구성, 지각적, 그리고 박스 회귀 손실과 함께 이미지/객체 두 판별자를 사용하는 적대적 손실로 학습한다.

실험 결과

연구 질문

RQ1외부 객체 크롭을 사용하는 준매개적 생성 프레임워크가 장면 그래프를 충실히 반영하면서도 객체 외형의 미세한 제어를 가능하게 하는 이미지를 생성할 수 있는가?
RQ2Crop Refining Network와 Object-Image Fuser를 통합하는 것이 이전의 장면 그래프-에서 이미지로 변환 방법과 비교해 객체 수준의 외형 일관성과 장면 수준 배치를 개선하는가?
RQ3맥락에 맞는 크롭을 검색하는 Crop Selector가 수동 외형 지시 없이 이미지 품질과 다양성을 향상시키는가?
RQ4제안된 구성요소들이 Visual Genome 및 COCO-Stuff에서 표준 이미지 합성 지표(IS, Diversity, FID)에 어떤 영향을 미치는가?
RQ5주어진 객체들로 복잡하고 다양한 장면을 생성하면서도 높은 시각적 충실도를 유지할 수 있는가?

주요 결과

방법	IS (COCO)	IS (VG)	다양성 (COCO)	다양성 (VG)	FID (COCO)	FID (VG)
Real Images	16.3±0.4	13.9±0.5	-	-	-	-
sg2im	6.7±0.1	5.5±0.1	0.02±0.01	0.12±0.06	82.75	71.27
PasteGAN	9.1±0.2	6.9±0.2	0.27±0.11	0.24±0.09	50.94	58.53
sg2im (GT)	7.3±0.1	6.3±0.2	0.02±0.01	0.15±0.12	63.28	52.96
PasteGAN (GT)	10.2±0.2	8.2±0.2	0.32±0.09	0.29±0.08	38.29	35.25

PasteGAN은 COCO-Stuff와 Visual Genome에서 sg2im보다 더 높은 Inception Score를 달성한다 (COCO: 9.1±0.2 대 6.7±0.1; VG: 6.9±0.2 대 5.5±0.1).
PasteGAN은 sg2im보다 낮은 Fréchet Inception Distance를 달성한다 (COCO: 50.94 대 82.75; VG: 58.53 대 71.27).
GT를 사용한 크롭은 예측 크롭에 비해 IS를 더 향상시키고 FID를 낮춘다 (COCO: IS 10.2±0.2, FID 38.29; VG: IS 8.2±0.2, FID 35.25).
layout2im과 비교할 때, PasteGAN은 IS/다양성에서 경쟁력 있거나 우수하고 FID가 우호적으로 나타나며, 크롭 유도 생성으로 객체 수준의 충실도가 향상된다.
적손실 연구는 Crop Selector, Object 2 Refiner, 또는 Object-Image Fuser를 제거하면 IS가 감소하고 FID가 증가함을 보여주며, 각 구성요소의 기여를 강조한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.