QUICK REVIEW

[論文レビュー] TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up

Yifan Jiang, Shiyu Chang|arXiv (Cornell University)|Feb 14, 2021

Generative Adversarial Networks and Image Synthesis参考文献 92被引用数 263

ひとこと要約

TransGANは、畳み込みを一切使用せず純粋なTransformersでGANを構築し、メモリに優しい生成器、マルチスケールの判別器、グリッド自己注意を特徴とし、競争力のある結果とスケーラブルな高解像度生成を達成します。

ABSTRACT

The recent explosive interest on transformers has suggested their potential to become powerful "universal" models for computer vision tasks, such as classification, detection, and segmentation. While those attempts mainly study the discriminative models, we explore transformers on some more notoriously difficult vision tasks, e.g., generative adversarial networks (GANs). Our goal is to conduct the first pilot study in building a GAN completely free of convolutions, using only pure transformer-based architectures. Our vanilla GAN architecture, dubbed TransGAN, consists of a memory-friendly transformer-based generator that progressively increases feature resolution, and correspondingly a multi-scale discriminator to capture simultaneously semantic contexts and low-level textures. On top of them, we introduce the new module of grid self-attention for alleviating the memory bottleneck further, in order to scale up TransGAN to high-resolution generation. We also develop a unique training recipe including a series of techniques that can mitigate the training instability issues of TransGAN, such as data augmentation, modified normalization, and relative position encoding. Our best architecture achieves highly competitive performance compared to current state-of-the-art GANs using convolutional backbones. Specifically, TransGAN sets new state-of-the-art inception score of 10.43 and FID of 18.28 on STL-10, outperforming StyleGAN-V2. When it comes to higher-resolution (e.g. 256 x 256) generation tasks, such as on CelebA-HQ and LSUN-Church, TransGAN continues to produce diverse visual examples with high fidelity and impressive texture details. In addition, we dive deep into the transformer-based generation models to understand how their behaviors differ from convolutional ones, by visualizing training dynamics. The code is available at https://github.com/VITA-Group/TransGAN.

研究の動機と目的

画像生成のための純粋なトランスフォーマーアーキテクチャを用いた畳み込みなしGANの探究を動機づける。
トランスフォーマーベースのGANに適したメモリ効率の良い生成器とマルチスケールの判別器を設計する。
訓練を安定化させ忠実度を向上させる技術（グリッド自己注意、データ拡張、修正正規化、相対位置エンコーディング）を開発する。
TransGANを小規模および大規模データセットで評価し、CNNベースのGANと比較して性能とスケーラビリティを検証する。

提案手法

メモリに優しいマルチステージのTransformerベース生成器を使用し、特徴マップの解像度を段階的に高める。
異なるサイズのパッチを処理するマルチスケールの判別器を実装し、グローバルな文脈と局所的な質感を捉える。
高解像度でのメモリ負荷を軽減しつつグローバルな一貫性を保持するためにGrid Self-Attentionを導入する。
訓練レシピとして強力なデータ拡張、修正正規化（トークンごとのスケーリング）、相対位置エンコーディングを適用して訓練を安定化させる。
高解像度生成（例：256×256）へスケールし、高品質な視覚結果とアブレーション研究を行う。

実験結果

リサーチクエスチョン

RQ1畳み込み層を一切使わず、純粋なトランスフォーマー成分だけでGANを効果的に構築できるか？
RQ2トランスフォーマーベースのGANにおいて安定で高忠実度な画像生成を実現するための設計と訓練戦略は何か？
RQ3メモリ効率的なアテンション機構（例：グリッド自己注意）は高解像度での品質とスケーラビリティにどのような影響を与えるか？
RQ4データ拡張と相対位置エンコーディングはTransGANの訓練安定性と性能にどのように影響するか？

主な発見

Method	CIFAR-10 IS ↑	CIFAR-10 FID ↓	STL-10 IS ↑	STL-10 FID ↓	CelebA FID ↓
TransGAN	9.02 ± 0.12	9.26	10.43 ± 0.16	18.28	5.28

TransGANは、最先端のCNNベースGANと比較してCIFAR-10、STL-10、CelebAで競争力のある定量的結果を強力なデータ拡張とともに達成した。
CIFAR-10でTransGANはInception Scoreが9.02、FIDが9.26を達成した。
STL-10で9.43のInception Scoreと18.28のFIDを達成した。
CelebA（128×128）でFIDが5.28となり、報告されている最良の結果に近づいた。
TransGANは高解像度生成（例：256×256）へスケールし、256×256 CelebA-HQおよびLSUN Churchのデモンストレーションで多様で高忠実度な出力を示す。
アブレーション研究は、グリッド自己注意と提案された訓練レシピ（データ拡張、修正正規化、相対位置エンコーディング）が性能を実質的に向上させることを示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。