QUICK REVIEW

[论文解读] Image Generation from Scene Graphs

Justin Johnson, Agrim Gupta|arXiv (Cornell University)|Apr 4, 2018

Multimodal Machine Learning Applications被引用 47

一句话总结

本文提出一个端到端模型，通过对图进行图卷积处理、预测包含边界框和掩模的场景布局，并通过级联细化网络进行渲染，在对抗判别器的训练下从场景图生成逼真的图像。

ABSTRACT

To truly understand the visual world our models should be able not only to recognize images but also generate them. To this end, there has been exciting recent progress on generating images from natural language descriptions. These methods give stunning results on limited domains such as descriptions of birds or flowers, but struggle to faithfully reproduce complex sentences with many objects and relationships. To overcome this limitation we propose a method for generating images from scene graphs, enabling explicitly reasoning about objects and their relationships. Our model uses graph convolution to process input graphs, computes a scene layout by predicting bounding boxes and segmentation masks for objects, and converts the layout to an image with a cascaded refinement network. The network is trained adversarially against a pair of discriminators to ensure realistic outputs. We validate our approach on Visual Genome and COCO-Stuff, where qualitative results, ablations, and user studies demonstrate our method's ability to generate complex images with multiple objects.

研究动机与目标

激发从结构化场景图生成图像，以处理具有多个对象及其关系的复杂场景。
开发基于图的场景图嵌入，以指引对象放置和布局。
通过场景布局和基于CRN的渲染，将符号化的场景图桥接到像素级图像。

提出的方法

使用图卷积网络处理场景图以生成对象嵌入。
预测每个对象的边界框和分割掩模以形成场景布局。
使用级联细化网络（CRN）将场景布局渲染为图像。
使用两个判别器进行对抗训练：一个图像空间判别器和一个对象聚焦判别器。

实验结果

研究问题

RQ1是否可以利用场景图在复杂场景中生成包含正确对象及其关系的图像？
RQ2基于图的推理是否能提升对象定位和布局预测以用于图像合成？
RQ3基于布局的方法在产生可识别对象和语义保真度方面与文本到图像方法相比如何？
RQ4对抗训练和对象级判别的贡献是什么？

主要发现

所提出的方法能够生成在 Visual Genome 和 COCO-Stuff 上符合输入场景图的复杂图像。
图卷积与关系建模相较于消融实验提升了对象定位和布局多样性。
使用 D_img 与 D_obj 进行对抗训练比仅像素级训练产生更真实的图像和可识别的对象。
用户研究显示，与对应该 COCO 派生任务的 StackGAN 相比，基于场景图的方法在语义可解释性和对象召回方面更高。
即使在测试时找不到真值布局，预测的布局（边界框和掩模）也能有效。
真实布局进一步提升图像质量，表明布局预测存在瓶颈，而非渲染阶段。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。