QUICK REVIEW

[论文解读] Generative Semantic Manipulation with Contrasting GAN

Xiaodan Liang, Hao Zhang|arXiv (Cornell University)|Aug 1, 2017

Generative Adversarial Networks and Image Synthesis被引用 27

一句话总结

本文提出一种对比生成对抗网络（contrast-GAN），用于生成式语义操控，能够在保持物体形状和视角不变的前提下，实现猫→狗或摩托车→自行车等大规模语义变换。通过优化相对特征距离——使生成图像与真实目标类别图像的距离更近于输入图像——该模型在ImageNet和MSCOCO数据集上相较于以往的GAN模型，在视觉保真度和语义准确性方面均表现更优。

ABSTRACT

Generative Adversarial Networks (GANs) have recently achieved significant improvement on paired/unpaired image-to-image translation, such as photo$ ightarrow$ sketch and artist painting style transfer. However, existing models can only be capable of transferring the low-level information (e.g. color or texture changes), but fail to edit high-level semantic meanings (e.g., geometric structure or content) of objects. On the other hand, while some researches can synthesize compelling real-world images given a class label or caption, they cannot condition on arbitrary shapes or structures, which largely limits their application scenarios and interpretive capability of model results. In this work, we focus on a more challenging semantic manipulation task, which aims to modify the semantic meaning of an object while preserving its own characteristics (e.g. viewpoints and shapes), such as cow$ ightarrow$sheep, motor$ ightarrow$ bicycle, cat$ ightarrow$dog. To tackle such large semantic changes, we introduce a contrasting GAN (contrast-GAN) with a novel adversarial contrasting objective. Instead of directly making the synthesized samples close to target data as previous GANs did, our adversarial contrasting objective optimizes over the distance comparisons between samples, that is, enforcing the manipulated data be semantically closer to the real data with target category than the input data. Equipped with the new contrasting objective, a novel mask-conditional contrast-GAN architecture is proposed to enable disentangle image background with object semantic changes. Experiments on several semantic manipulation tasks on ImageNet and MSCOCO dataset show considerable performance gain by our contrast-GAN over other conditional GANs. Quantitative results further demonstrate the superiority of our model on generating manipulated results with high visual fidelity and reasonable object semantics.

研究动机与目标

实现可控图像生成，可在保持物体几何结构和视角不变的前提下执行大规模语义变换（例如猫→狗）。
克服现有GAN仅能修改颜色或纹理等低级特征的局限性。
开发一种基于复杂、结构化条件（如物体掩码）的条件图像合成方法，而非固定标签或描述。
通过掩码条件架构将背景与物体级语义操控解耦。
通过学习比较性特征距离，提升无监督图像生成的可解释性与控制性。

提出的方法

提出一种新颖的对抗对比目标，通过在特征空间中比较生成样本、输入图像与真实目标类别图像之间的距离。
使用一个在所有语义类别间共享的条件生成器，通过目标类别和物体掩码进行条件控制，以实现局部化操控。
采用多个语义感知判别器，确保生成图像在语义上更接近真实目标类别图像，而非输入图像。
引入一个全局判别器 $D_I$ 以验证整体图像的真实性，并补充对比损失。
将对比损失与LSGAN损失及循环一致性损失结合，以稳定训练并提升视觉质量。
采用掩码条件架构，以隔离并操控特定物体实例，同时保持背景和空间上下文。

实验结果

研究问题

RQ1基于GAN的模型能否在保持物体形状和视角不变的前提下，实现大规模语义操控（如猫→狗）？
RQ2对抗对比目标（通过比较相对特征距离）是否能优于标准GAN目标，提升语义操控性能？
RQ3一个基于类别标签和物体掩码条件的共享生成器，是否能优于每个类别单独使用生成器的方法？
RQ4在无配对图像到图像翻译及语义操控任务中，该方法与CycleGAN及其他GAN相比表现如何？
RQ5掩码条件控制在多大程度上能实现背景与物体级语义变化的解耦？

主要发现

contrast-GAN模型在语义操控任务（如MSCOCO数据集上的猫↔狗和自行车↔摩托车）中显著优于基线GAN模型，包括CycleGAN和其他条件GAN。
在AMT感知真实度基准测试中，该模型得分显著高于基线模型，尤其在需要大规模语义变换的任务中表现更优。
消融实验表明，对比损失、LSGAN损失和循环一致性损失三者均为实现最优性能所必需。
共享生成器结合掩码条件控制，其性能与每个类别单独使用生成器相当或更优，同时减小了模型规模并提升了鲁棒性。
引入辅助全局判别器 $D_I$ 的模型进一步提升了视觉保真度，证明其在真实性评估中具有互补作用。
定性结果表明，该模型对物体结构和纹理仅进行最小但有效的修改，成功实现了语义身份的转换，同时保持了原始视角和背景交互关系。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。