QUICK REVIEW

[論文レビュー] PasteGAN: A Semi-Parametric Method to Generate Image from Scene Graph

Yikang Li, Tao Ma|arXiv (Cornell University)|May 5, 2019

Multimodal Machine Learning Applications参考文献 33被引用数 42

ひとこと要約

PasteGAN は、外部のオブジェクトクロップをアンカーとしてシーングラフから画像を生成する。Crop Refining Network と Object-Image Fuser を介し、互換性のあるクロップを取得する Crop Selector を備える。Visual Genome と COCO-Stuff で SOTA より高い IS/多様性を達成し、FID は低い。

ABSTRACT

Despite some exciting progress on high-quality image generation from structured(scene graphs) or free-form(sentences) descriptions, most of them only guarantee the image-level semantical consistency, i.e. the generated image matching the semantic meaning of the description. They still lack the investigations on synthesizing the images in a more controllable way, like finely manipulating the visual appearance of every object. Therefore, to generate the images with preferred objects and rich interactions, we propose a semi-parametric method, PasteGAN, for generating the image from the scene graph and the image crops, where spatial arrangements of the objects and their pair-wise relationships are defined by the scene graph and the object appearances are determined by the given object crops. To enhance the interactions of the objects in the output, we design a Crop Refining Network and an Object-Image Fuser to embed the objects as well as their relationships into one map. Multiple losses work collaboratively to guarantee the generated images highly respecting the crops and complying with the scene graphs while maintaining excellent image quality. A crop selector is also proposed to pick the most-compatible crops from our external object tank by encoding the interactions around the objects in the scene graph if the crops are not provided. Evaluated on Visual Genome and COCO-Stuff dataset, our proposed method significantly outperforms the SOTA methods on Inception Score, Diversity Score and Fréchet Inception Distance. Extensive experiments also demonstrate our method's ability to generate complex and diverse images with given objects.

研究の動機と目的

シーングラフから生成された画像におけるオブジェクトの外観を細粒度で制御する動機付け。
シーングラフ構造を尊重しつつ、外部のオブジェクトクロップを用いてレンダリングをガイドする半パラメトリックなフレームワークを提案する。
ユーザー指定のオブジェクト外観がない場合のシナリオにも対処する自動クロップ選択を実現する。
オブジェクトの外観と関係性を統一された潜在キャンバスに融合させ、高品質な画像合成を実現する。

提案手法

シーングラフをグラフ畳み込みネットワークで表現し、オブジェクトごとの文脈ベクトルを得る。
オブジェクトクロップをエンコードし、関係性を意識した特徴と融合する Crop Refining Network を導入する（Object 2 Refiner）。
シーングラフの関係に guided されて潜在的なシーンキャンバスにオブジェクト特徴を注入する attentional な Object-Image Fuser を使用する。
シーングラフの文脈に基づいて外部オブジェクトタンクから最も適合するオブジェクトクロップを取得する Crop Selector を追加する。
クロップ、オブジェクト、シーンレイアウトを整合させるため、画像とオブジェクトの二つの識別器を用いた対向的損失に、再構成、知覚的、ボックス回帰損失を加えた訓練を行う。

実験結果

リサーチクエスチョン

RQ1半パラメトリック生成フレームワーク（外部のオブジェクトクロップを使用）によって、シーングラフを忠実に尊重しつつ、オブジェクトの外観を細かく制御できる画像を生成できるか。
RQ2Crop Refining Network と Object-Image Fuser を統合することは、従来のシーングラフから画像への手法と比較して、オブジェクトレベルの外観の一貫性とシーンレベルの配置を改善するか。
RQ3文脈的に適合するクロップを取得する Crop Selector は、手動の外観指定なしでも画像の質と多様性を向上させるか。
RQ4提案されたコンポーネントは、Visual Genome および COCO-Stuff における標準的な画像合成指標（IS、Diversity、FID）にどう影響するか。
RQ5与えられたオブジェクトで複雑かつ多様なシーンを高い視覚忠実度を維持しつつ生成する能力があるか。

主な発見

手法	IS (COCO)	IS (VG)	Diversity (COCO)	Diversity (VG)	FID (COCO)	FID (VG)
Real Images	16.3±0.4	13.9±0.5	-	-	-	-
sg2im	6.7±0.1	5.5±0.1	0.02±0.01	0.12±0.06	82.75	71.27
PasteGAN	9.1±0.2	6.9±0.2	0.27±0.11	0.24±0.09	50.94	58.53
sg2im (GT)	7.3±0.1	6.3±0.2	0.02±0.01	0.15±0.12	63.28	52.96
PasteGAN (GT)	10.2±0.2	8.2±0.2	0.32±0.09	0.29±0.08	38.29	35.25

PasteGAN は COCO-Stuff および Visual Genome で sg2im より高い Inception Score を達成（COCO: 9.1±0.2 vs 6.7±0.1、VG: 6.9±0.2 vs 5.5±0.1）。
PasteGAN は sg2im より低い Fréchet Inception Distance を達成（COCO: 50.94 vs 82.75、VG: 58.53 vs 71.27）。
GT を用いたクロップは、予測クロップと比べて IS をさらに向上させ、FID を低下させる（COCO：IS 10.2±0.2、FID 38.29、VG：IS 8.2±0.2、FID 35.25）。
layout2im と比較して、PasteGAN は IS/Diversity が競争力があり、FID も有利で、クロップ誘導生成によりオブジェクトレベルの忠実度が向上。
アブレーション研究は、Crop Selector、Object 2 Refiner、Object-Image Fuser のいずれかを除くと IS が低下し FID が悪化することを示し、各コンポーネントの寄与を強調する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。