QUICK REVIEW

[論文レビュー] Context-Aware Synthesis and Placement of Object Instances

Dong‐Hoon Lee, Sifei Liu|arXiv (Cornell University)|Dec 6, 2018

Multimodal Machine Learning Applications被引用数 65

ひとこと要約

論文は、whereとwhatの2つの相互接続モジュールを持つエンドツーエンドの条件付きGANフレームワークを提示し、シーン文脈に条件づけてオブジェクトインスタンスマスクをセマンティックラベルマップに合成・配置し、位置分布と形状分布の両方をモデリングする。

ABSTRACT

Learning to insert an object instance into an image in a semantically coherent manner is a challenging and interesting problem. Solving it requires (a) determining a location to place an object in the scene and (b) determining its appearance at the location. Such an object insertion model can potentially facilitate numerous image editing and scene parsing applications. In this paper, we propose an end-to-end trainable neural network for the task of inserting an object instance mask of a specified class into the semantic label map of an image. Our network consists of two generative modules where one determines where the inserted object mask should be (i.e., location and scale) and the other determines what the object mask shape (and pose) should look like. The two modules are connected together via a spatial transformation network and jointly trained. We devise a learning procedure that leverage both supervised and unsupervised data and show our model can insert an object at diverse locations with various appearances. We conduct extensive experimental validations with comparisons to strong baselines to verify the effectiveness of the proposed network.

研究の動機と目的

シーンの意味を尊重する形で、画像に新しいオブジェクトインスタンスを挿入する問題を動機づけ、解決する。
入力のセマンティックマップを条件として、どこにオブジェクトを配置し、形状/ポーズをどうすべきかの結合分布を学ぶ。
画像編集、AR/VR、データ増強に適した多様で妥当なオブジェクト挿入を実現する。

提案手法

2つの生成モジュール：whereモジュールはSpatial Transformer Network (STN)を用いたアフィン変換で位置/スケールを予測する; whatモジュールはその位置を条件にオブジェクトマスクを生成する。
各モジュールは共有エンコーダを備えた条件付きGANで、変動をモデル化するユニットガウス変分潜在を組み込む。
training uses a three-term loss for the where module: adversarial layout loss, input reconstruction loss, and supervised affine-transform loss, to mitigate mode collapse.
The what module mirrors this with discriminators for layout and shape, and a supervised path to promote diverse, realistic shapes.
An end-to-end differentiable link between modules via the STN enables joint optimization and consistent placement of generated shapes.
During training, a supervised path and an unsupervised path are used to alleviate mode collapse; inference uses only the unsupervised path.

実験結果

リサーチクエスチョン

RQ1シーンの文脈と幾何を尊重しつつ、オブジェクトのインスタンスをセマンティックラベルマップに妥当にはめ込むにはどうすればよいか？
RQ2入力シーンを条件として、オブジェクトをどこに配置し、どの形状を生成するかの結合分布をモデルが学習できるか？
RQ3問題をwhereとwhatの別々のモジュールに分解することで、訓練の安定性と出力の多様性が向上するか？
RQ4下流の認識/検出で測定される実世界の文脈との整合性はどれくらいか？
RQ5主要な識別器と監視が多様性と現実性の維持に与える影響は？

主な発見

提案されたアーキテクチャは、文脈認識型の配置場所（where）と形状（what）の妥当な分布を学習する。
STNを介した微分可能な結合を備えた2モジュールのエンドツーエンド学習設計は、配置と外観の共同最適化を可能にする。
アブレーション研究では、識別器や監視を除くとモード崩壊や多様性・場所精度の低下が生じる。
人間評価では、43%のケースで作成された挿入が実在と判断され、強い現実性を示した。
Cityscapesテストでの定量的リコールは、全モデルが0.79のリコールを達成し、アブレーションされた変種より高く、全成分の利点を示している。
この手法は、すべての識別器を使用した場合のみ、挿入されたインスタンスが最先端の検出器で検出される可能性を高めるか？（表の全モデルリコールは0.79）

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。