QUICK REVIEW

[论文解读] You Only Need Adversarial Supervision for Semantic Image Synthesis

Vadim Sushko, Edgar Schönfeld|arXiv (Cornell University)|Dec 8, 2020

Generative Adversarial Networks and Image Synthesis参考文献 55被引用 70

一句话总结

OASIS 引入了基于分割的判别器和3D噪声驱动的生成器，通过仅使用对抗监督实现高质量、多样的语义图像合成，消除了感知损失。

ABSTRACT

Despite their recent successes, GAN models for semantic image synthesis still suffer from poor image quality when trained with only adversarial supervision. Historically, additionally employing the VGG-based perceptual loss has helped to overcome this issue, significantly improving the synthesis quality, but at the same time limiting the progress of GAN models for semantic image synthesis. In this work, we propose a novel, simplified GAN model, which needs only adversarial supervision to achieve high quality results. We re-design the discriminator as a semantic segmentation network, directly using the given semantic label maps as the ground truth for training. By providing stronger supervision to the discriminator as well as to the generator through spatially- and semantically-aware discriminator feedback, we are able to synthesize images of higher fidelity with better alignment to their input label maps, making the use of the perceptual loss superfluous. Moreover, we enable high-quality multi-modal image synthesis through global and local sampling of a 3D noise tensor injected into the generator, which allows complete or partial image change. We show that images synthesized by our model are more diverse and follow the color and texture distributions of real images more closely. We achieve an average improvement of $6$ FID and $5$ mIoU points over the state of the art across different datasets using only adversarial supervision.

研究动机与目标

通过加强判别器反馈来推动在语义图像合成中移除感知损失。
设计一个利用语义标签图进行像素级、类别感知监督的判别器。
开发一个通过在所有层注入3D噪声实现多模态输出的生成器。
在ADE20K、Cityscapes和COCO-stuff上相比现有方法展示更高的图像质量和多样性。

提出的方法

将判别器重新设计为一个语义分割网络（N+1 类：N 个真实语义类别 + 1 个伪造）并使用逆频权重来实现类别平衡。
引入 LabelMix 正则化，促使判别器关注语义和结构差异，在标签引导的混合下强制保持一致性。
用使用分割基判别器的对抗损失替代生成器的训练。
通过在所有层向生成器注入3D噪声张量实现多模态合成，允许全局和局部（按分段/像素）变化。
通过移除一个初始残差块来降低复杂度，打造更轻量的生成器（72M 参数）。
比较标签映射编码策略并消融架构选择，以在没有感知损失的情况下验证判别器的有效性。

实验结果

研究问题

RQ1基于分割的判别器是否能为生成器提供比传统多尺度判别器更强的、语义感知的反馈？
RQ2在判别器提供语义感知监督的情况下，感知（VGG）损失是否仍然是高质量语义图像合成所必需的？
RQ3基于3D噪声的多模态合成是否在不牺牲图像保真度的前提下提升多样性？
RQ4LabelMix 正则化如何影响生成图像的真实感和语义对齐？

主要发现

OASIS 在 ADE20K, Cityscapes, and COCO-stuff 上实现了最先进的结果，平均在 FID 提升 6 点，在 mIoU 提升 5 点，相比仅使用对抗监督的先前方法。
基于分割的判别器（N+1 类）提供逐像素的语义感知反馈，替代像 VGG 这样的感知损失的需求。
基于3D噪声驱动的多模态合成实现全局和局部外观变化，在保持语义对齐的同时提高多样性。
消融研究表明用 OASIS 判别器替代 SPADE+ 可带来较大的 FID/mIoU 提升；加入 3D 噪声提升多样性；感知损失可能对多样性有细微影响，且在没有更好判别器时有时会降低 FID。
LabelMix 正则化通过促使判别器尊重语义边界和内容差异，提升像素级真实感。
与 SPADE+ 相比，OASIS 在无感知损失时提供更好的 FID 和 mIoU，证明了强判别器驱动的监督。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。