QUICK REVIEW

[Paper Review] Image Generation from Scene Graphs

Justin Johnson, Agrim Gupta|arXiv (Cornell University)|Apr 4, 2018

Multimodal Machine Learning Applications47 citations

TL;DR

The paper presents a end-to-end model that generates realistic images from scene graphs by processing graphs with graph convolutions, predicting a scene layout of bounding boxes and masks, and rendering via a cascaded refinement network, trained with adversarial discriminators.

ABSTRACT

To truly understand the visual world our models should be able not only to recognize images but also generate them. To this end, there has been exciting recent progress on generating images from natural language descriptions. These methods give stunning results on limited domains such as descriptions of birds or flowers, but struggle to faithfully reproduce complex sentences with many objects and relationships. To overcome this limitation we propose a method for generating images from scene graphs, enabling explicitly reasoning about objects and their relationships. Our model uses graph convolution to process input graphs, computes a scene layout by predicting bounding boxes and segmentation masks for objects, and converts the layout to an image with a cascaded refinement network. The network is trained adversarially against a pair of discriminators to ensure realistic outputs. We validate our approach on Visual Genome and COCO-Stuff, where qualitative results, ablations, and user studies demonstrate our method's ability to generate complex images with multiple objects.

Motivation & Objective

Motivate generating images from structured scene graphs to handle complex scenes with multiple objects and relationships.
Develop a graph-based embedding for scene graphs to inform object placement and layout.
Bridge from symbolic scene graphs to pixel-level images through a scene layout and CRN-based rendering.

Proposed method

Process scene graphs with a graph convolution network to produce object embeddings.
Predict per-object bounding boxes and segmentation masks to form a scene layout.
Render the scene layout into an image using a cascaded refinement network (CRN).
Train the entire pipeline adversarially with two discriminators: an image-space discriminator and an object-focused discriminator.

Experimental results

Research questions

RQ1Can scene graphs be leveraged to generate images with correct objects and relationships in complex scenes?
RQ2Does graph-based reasoning improve object localization and layout prediction for image synthesis?
RQ3How do layout-based approaches compare to text-to-image methods in producing recognizable objects and semantic fidelity?
RQ4What is the contribution of adversarial training and object-level discrimination to image realism?

Key findings

The proposed method generates complex images that respect input scene graphs on Visual Genome and COCO-Stuff.
Graph convolution and relationship modeling improve object localization and layout variety compared to ablations.
Adversarial training with D_img and D_obj yields more realistic images and recognizable objects than pixel-only training.
User studies show higher semantic interpretability and object recall for the scene-graph-based method than StackGAN on corresponding COCO-derived tasks.
Predicted layouts (bounding boxes and masks) can be effective even when ground-truth layouts are unavailable at test time.
Ground-truth layouts further improve image quality, indicating a bottleneck in layout prediction rather than rendering.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.