QUICK REVIEW

[论文解读] Image Transformer

Niki Parmar, Ashish Vaswani|arXiv (Cornell University)|Feb 15, 2018

Generative Adversarial Networks and Image Synthesis参考文献 11被引用 200

一句话总结

本文提出图像变换器（Image Transformer），一种基于自注意力机制的生成模型，通过将注意力限制在局部邻域内，将原始Transformer架构中的自注意力机制应用于图像生成，从而实现对大尺寸图像的高效建模并获得较大的感受野。该模型在ImageNet上实现了SOTA性能，负对数似然值为3.77，优于此前SOTA的3.83。

ABSTRACT

Image generation has been successfully cast as an autoregressive sequence generation or transformation problem. Recent work has shown that self-attention is an effective way of modeling textual sequences. In this work, we generalize a recently proposed model architecture based on self-attention, the Transformer, to a sequence modeling formulation of image generation with a tractable likelihood. By restricting the self-attention mechanism to attend to local neighborhoods we significantly increase the size of images the model can process in practice, despite maintaining significantly larger receptive fields per layer than typical convolutional neural networks. While conceptually simple, our generative models significantly outperform the current state of the art in image generation on ImageNet, improving the best published negative log-likelihood on ImageNet from 3.83 to 3.77. We also present results on image super-resolution with a large magnification ratio, applying an encoder-decoder configuration of our architecture. In a human evaluation study, we find that images generated by our super-resolution model fool human observers three times more often than the previous state of the art.

研究动机与目标

将原本专为序列数据设计的Transformer架构扩展至图像生成任务，并实现可计算的似然估计。
通过将注意力限制在图像的局部空间邻域内，解决在图像中使用完整自注意力机制带来的计算不可行性问题。
通过在减少计算量的同时保持较大的有效感受野，提升图像生成的质量与可扩展性。
通过所提出的架构，在图像生成与超分辨率任务中展示SOTA性能。

提出的方法

模型采用标准的Transformer解码器结构，结合多头自注意力机制，但将每头的注意力范围限制在图像的局部空间邻域内。
局部注意力机制使得即使在自注意力机制本身具有二次方复杂度的前提下，仍能实现高效计算与大规模图像的可扩展性。
模型采用自回归方式训练，按顺序预测像素点，在因子分解建模假设下可实现可计算的似然估计。
在超分辨率任务中，采用编码器-解码器结构，其中编码器处理低分辨率图像，解码器生成高分辨率输出。
该架构在每层中保持了显著大于标准卷积神经网络的感受野，从而增强了特征表示能力。
训练过程采用标准交叉熵损失函数，并结合标签平滑与学习率调度进行优化。

实验结果

研究问题

RQ1Transformer架构能否被有效适配至图像生成任务，并实现可计算的似然估计？
RQ2对注意力机制施加局部限制后，对模型在大图像上的性能与可扩展性有何影响？
RQ3图像变换器能否在ImageNet图像生成任务中超越现有的卷积神经网络与自回归模型？
RQ4该模型在其他图像到图像的转换任务（如超分辨率）中是否具备良好的泛化能力？
RQ5在超分辨率质量的人工评估中，该模型相较于先前工作表现如何？

主要发现

在ImageNet上，图像变换器的负对数似然值达到3.77，优于此前SOTA的3.83。
在生成质量方面，该模型在似然值与人工评估中均显著优于先前方法。
在大倍率超分辨率任务中，该模型使人类观察者误判的概率是此前SOTA的三倍。
局部注意力机制使得在全自注意力机制下此前难以实现的大图像训练成为可能。
该模型在每层中保持了较大的有效感受野，增强了其建模长距离依赖关系的能力。
人工评估结果证实，生成的超分辨率图像在真实感与与真实图像的区分度上均优于先前模型。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。