QUICK REVIEW

[论文解读] Context-Aware Synthesis and Placement of Object Instances

Dong‐Hoon Lee, Sifei Liu|arXiv (Cornell University)|Dec 6, 2018

Multimodal Machine Learning Applications被引用 65

一句话总结

本文提出一个端到端的条件GAN框架，包含两个互相连接的模块（where and what），用于在语义标签图中合成并放置对象实例掩码，建模基于场景上下文的定位分布和形状分布。

ABSTRACT

Learning to insert an object instance into an image in a semantically coherent manner is a challenging and interesting problem. Solving it requires (a) determining a location to place an object in the scene and (b) determining its appearance at the location. Such an object insertion model can potentially facilitate numerous image editing and scene parsing applications. In this paper, we propose an end-to-end trainable neural network for the task of inserting an object instance mask of a specified class into the semantic label map of an image. Our network consists of two generative modules where one determines where the inserted object mask should be (i.e., location and scale) and the other determines what the object mask shape (and pose) should look like. The two modules are connected together via a spatial transformation network and jointly trained. We devise a learning procedure that leverage both supervised and unsupervised data and show our model can insert an object at diverse locations with various appearances. We conduct extensive experimental validations with comparisons to strong baselines to verify the effectiveness of the proposed network.

研究动机与目标

动机并解决在尊重场景语义的方式下，将新的对象实例插入到图像中的问题。
学习在输入语义地图条件下，放置对象的位置以及其形状/姿态的联合分布。
实现多样且真实感强的对象插入，适用于图像编辑、AR/VR和数据增强。

提出的方法

两个生成模块：where 模块通过 Spatial Transformer Network (STN) 使用仿射变换预测位置/尺度；what 模块在位置条件下生成对象掩码。
每个模块都是一个带有共享编码器的条件GAN，包含一个单位高斯变分潜在变量以建模变化。
训练对 where 模块使用三项损失：对抗布局损失、输入重构损失和有监督的仿射变换损失，以缓解模式坍缩。
what 模块以布局和形状的判别器镜像实现，并增加一个有监督路径以促进多样且真实的形状。
通过 STN 的端到端可微连接实现模块之间的联合优化和生成形状的一致放置。
训练期间，使用有监督路径和无监督路径以缓解模式坍缩；推断阶段仅使用无监督路径。

实验结果

研究问题

RQ1如何在尊重场景上下文和几何信息的前提下，将对象实例合理地插入到语义标签图中？
RQ2模型是否能够在输入场景条件下学习放置对象的位置和生成形状的联合分布？
RQ3将问题分解为单独的 where 和 what 模块是否能提升训练稳定性和输出的多样性？
RQ4生成的插入与现实世界场景的对齐程度如何，以下游识别/检测为度量？
RQ5关键判别器和监督对保持多样性与真实感的影响是什么？

主要发现

所提出的架构学习了面向上下文的、可行的对象位置（where）和形状（what）的分布。
一个两模块的端到端可训练设计，通过 STN 实现可微耦合，使放置与外观的联合优化成为可能。
在消融研究中，移除判别器或监督会导致模式坍缩或多样性/位置准确性下降。
人工评估显示，在43%的案例中，工作者认为合成插入为真实，表明强烈的真实感。
在 Cityscapes 测试上的定量召回率显示，完整模型达到0.79的召回，而消融变体的召回较低，表明所有组件的好处。
该方法在所有判别器被使用时才提高插入实例被最先进检测器检测到的可能性？（表中全模型召回为0.79）

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。