QUICK REVIEW

[論文レビュー] Improving Diffusion-Based Image Synthesis with Context Prediction

L. Yang, Jingwei Liu|arXiv (Cornell University)|Jan 4, 2024

Generative Adversarial Networks and Image Synthesis被引用数 8

ひとこと要約

ConPreDiff を導入する。拡散モデルの文脈予測フレームワークで、各ピクセル/トークンを強化して隣接する文脈を文脈デコーダーを介して予測し、推論コストを追加することなく無条件・テキストから画像生成・インペインティングの全タスクで画像生成を改善する。

ABSTRACT

Diffusion models are a new class of generative models, and have dramatically promoted image generation with unprecedented quality and diversity. Existing diffusion models mainly try to reconstruct input image from a corrupted one with a pixel-wise or feature-wise constraint along spatial axes. However, such point-based reconstruction may fail to make each predicted pixel/feature fully preserve its neighborhood context, impairing diffusion-based image synthesis. As a powerful source of automatic supervisory signal, context has been well studied for learning representations. Inspired by this, we for the first time propose ConPreDiff to improve diffusion-based image synthesis with context prediction. We explicitly reinforce each point to predict its neighborhood context (i.e., multi-stride features/tokens/pixels) with a context decoder at the end of diffusion denoising blocks in training stage, and remove the decoder for inference. In this way, each point can better reconstruct itself by preserving its semantic connections with neighborhood context. This new paradigm of ConPreDiff can generalize to arbitrary discrete and continuous diffusion backbones without introducing extra parameters in sampling procedure. Extensive experiments are conducted on unconditional image generation, text-to-image generation and image inpainting tasks. Our ConPreDiff consistently outperforms previous methods and achieves a new SOTA text-to-image generation results on MS-COCO, with a zero-shot FID score of 6.21.

研究の動機と目的

拡散モデルにおける点ごとの再構成の限界、局所的な隣接文脈を見落とす可能性を動機づけ、対処する。
訓練中に各点を強化して隣接文脈を推定する文脈予測機構を提案する。
分布ベースの予測とワッサースタイン距離を用いた効率的な隣接文脈デコーディング戦略を開発する。
推論コストを追加せずに、離散・連続のどちらの拡散バックボーンにも ConPreDiff の一般化可能性を示す。
無条件生成、テキストから画像生成、そしてインペインティングの分野で最先端の性能を示す。

提案手法

デノイズィングネットワークの末尾近くに文脈予測ヘッドを追加し、各点について多ストライドの隣接文脈を予測する。
隣接情報を多ストライドの隣人の分布として表現し、ニューラルネットワークを介してデコードする。
デコードされた隣接分布を真の文脈に合わせるためにワッサースタイン距離ベースの損失を用い、効率的な大規模文脈デコーディングを実現する。
隣接予測を分布予測へ再構成して、膨大なパラメータ増大を回避する。
特定の集約の下で ConPreDiff 損失が標準 DDPM 目的関数を上界することを示す理論的な関連を提供する。
推論を変えずに訓練時に文脈損失項を追加することで、離散・連続の両方の拡散バックボーンに ConPreDiff を一般化する。

実験結果

リサーチクエスチョン

RQ1明示的な隣接文脈予測は、拡散ベースの画像合成において忠実度と多様性を向上させるか。
RQ2分布を介して隣接文脈を予測する（全ピクセル/特徴量のデコードではなく）ことは、大規模な文脈へ efficiently scales に対応するか。
RQ3ConPreDiff は離散・連続の拡散バックボーンの両方と互換性があり、さまざまな視覚タスクで有益か。
RQ4異なる隣接ストライドが生成品質と訓練効率に与える影響は何か。

主な発見

ConPreDiff はテキストから画像生成と画像インペインティングのタスクで、従来の拡散モデルおよび非拡散モデルを上回る。
離散・連続の ConPreDiff が MS-COCO のテキストから画像生成で新しい最先端 FID スコアを達成。
文脈予測は既存の拡散バックボーンに適用した場合、生成品質を一貫して向上させる。
分布ベースの隣接デコーディングとワッサースタイン損失により、計算コストを抑えつつ大規模文脈のモデリングを実現。
文脈拡張は無条件画像生成、テキストから画像生成、インペインティングのすべてで改善を提供し、局所文脈の保存性の向上に起因する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。