QUICK REVIEW

[論文レビュー] Text-Adaptive Generative Adversarial Networks: Manipulating Images with Natural Language

Seonghyeon Nam, Yunji Kim|arXiv (Cornell University)|Oct 29, 2018

Multimodal Machine Learning Applications被引用数 66

ひとこと要約

TAGAN はテキスト適応型判別器を用い、単語レベルの局所判別子で自然言語に従って画像属性を操作しつつ、テキストと無関係な内容を保持することで、CUBとOxford-102でベースラインを上回る。

ABSTRACT

This paper addresses the problem of manipulating images using natural language description. Our task aims to semantically modify visual attributes of an object in an image according to the text describing the new visual appearance. Although existing methods synthesize images having new attributes, they do not fully preserve text-irrelevant contents of the original image. In this paper, we propose the text-adaptive generative adversarial network (TAGAN) to generate semantically manipulated images while preserving text-irrelevant contents. The key to our method is the text-adaptive discriminator that creates word-level local discriminators according to input text to classify fine-grained attributes independently. With this discriminator, the generator learns to generate images where only regions that correspond to the given text are modified. Experimental results show that our method outperforms existing methods on CUB and Oxford-102 datasets, and our results were mostly preferred on a user study. Extensive analysis shows that our method is able to effectively disentangle visual attributes and produce pleasing outputs.

研究の動機と目的

自然言語描述に導かれる意味的な画像操作を動機づける。
背景や他の内容を保持しつつ、テキストで記述された属性のみを変えられるようにする。
単語レベルの判別子を言及するテキスト適応型識別器を導入し、細粒度のフィードバックを可能にする。
CUBとOxford-102で既存手法より質的・量的に改善した性能を示す。

提案手法

入力画像を特徴表現へEncodeし、テキストに条件付けて操作済み出力を生成する。
ターゲットテキストを双方向RNNでEncodeし、単語ベクトルを生成する。
文を単語レベルの判別子へ分解し、それぞれ対応する視覚属性を検出するよう学習させる。
単語レベルのスコアをテキスト注意機構で集約し、最終的な条件付き判別器を形成する。
正のテキストが与えられた場合にテキストと無関係な内容を保持する再構成損失を用いる。
マルチスケールの画像特徴を用いて、いくつかの属性を異なる空間スケールで検出できるようにする。

実験結果

リサーチクエスチョン

RQ1TAGAN は、視覚属性を与えられたテキスト記述と整合させつつ、関連性のない内容を保持した入力画像を操作できるか？
RQ2テキスト適応型識別器は、文レベルの識別器と比較してより細粒度の属性制御を可能にするか？
RQ3属性の正確さと自然さの点で、TAGAN は既存のテキストから画像への合成ベースラインとどう比較されるか？
RQ4マルチスケール特徴の使用は、粗い属性と細かな属性の扱いにどのような影響を与えるか？
RQ5TAGAN が視覚属性をテキスト無関係内容からよりうまく分離する証拠はあるか？

主な発見

Method	CUB Accuracy	CUB Naturalness	CUB L2 error	Oxford-102 Accuracy	Oxford-102 Naturalness	Oxford-102 L2 error
SISGAN [15]	2.33	2.34	0.30	2.67	2.28	0.29
AttnGAN [13]	2.19	2.11	0.25	2.21	2.10	0.32
Ours	1.49	1.56	0.11	1.52	1.62	0.11

TAGAN は SISGAN および AttnGAN に対して、正確さ（属性とテキストの整合性および内容保持）と自然さの両方のユーザ調査で上回った。
正のテキストを与えた場合、他の手法の中で最も低い L2 再構成誤差を達成し、テキスト無関係内容の保持がより良いことを示した。
定性的な結果は、背景や説明されていない内容を保持しつつ、属性変更が正確であることを示している。
アブレーションにより、マルチスケール特徴が異なる属性スケールの扱いを改善することが示された。
CAM 可視化と定性的分析は、単語レベルの注意機構と層ごとの属性検出器が単語ごとに適応することを示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。