QUICK REVIEW

[論文レビュー] Learning Physical Graph Representations from Visual Scenes

Daniel M. Bear, Chaofei Fan|arXiv (Cornell University)|Jun 22, 2020

Human Pose and Action Recognition参考文献 52被引用数 44

ひとこと要約

PSGNet は物理的シーングラフ（PSG）を学習し、シーンを階層的でオブジェクト中心のグラフとして表現し、実世界のシーン分割において CNN ベースの自己教師あり手法を上回り、運動情報と知覚的グルーピングの原理により支援される。

ABSTRACT

Convolutional Neural Networks (CNNs) have proved exceptional at learning representations for visual object categorization. However, CNNs do not explicitly encode objects, parts, and their physical properties, which has limited CNNs' success on tasks that require structured understanding of visual scenes. To overcome these limitations, we introduce the idea of Physical Scene Graphs (PSGs), which represent scenes as hierarchical graphs, with nodes in the hierarchy corresponding intuitively to object parts at different scales, and edges to physical connections between parts. Bound to each node is a vector of latent attributes that intuitively represent object properties such as surface shape and texture. We also describe PSGNet, a network architecture that learns to extract PSGs by reconstructing scenes through a PSG-structured bottleneck. PSGNet augments standard CNNs by including: recurrent feedback connections to combine low and high-level image information; graph pooling and vectorization operations that convert spatially-uniform feature maps into object-centric graph structures; and perceptual grouping principles to encourage the identification of meaningful scene elements. We show that PSGNet outperforms alternative self-supervised scene representation algorithms at scene segmentation tasks, especially on complex real-world images, and generalizes well to unseen object types and scene arrangements. PSGNet is also able learn from physical motion, enhancing scene estimates even for static images. We present a series of ablation studies illustrating the importance of each component of the PSGNet architecture, analyses showing that learned latent attributes capture intuitive scene properties, and illustrate the use of PSGs for compositional scene inference.

研究の動機と目的

物理シーングラフ（PSG）を階層的でオブジェクト中心のシーン表現として、物理的に意味のあるノード属性とともに導入する。
PSGベースのボトルネックを通じてシーンを再構成する自己教師ありアーキテクチャである PSGNet を開発する。
知覚的グルーピングの原理とグラフベースの操作を組み込み、視覚データから PSG を学習・レンダリングする。
PSGNet が実世界の画像に対する非監視型のシーン分割で優れた性能を発揮し、運動情報からの利点を活用することを示す。

提案手法

ノードごとの属性が画像領域に紐づく階層的なグラフベースのシーン表現（PSG）を定義する。
ConvRNN バックボーンを用いて特徴を抽出し、PSG 構築のベーステンソルを生成する。
学習可能な Graph Pooling と Graph Vectorization を適用し、PSG レイヤを反復的に構築する。
グラフレンダリングモジュールを介して PSG を特徴マップへ再レンダリングする（Quadratic Texture Rendering および Quadratic Shape Rendering を含む）。
ノード間の親和性学習を導くため、静的および運動ベースの知覚的グルーピング原理を組み込む。
RGB/深度/法線マップの自己教師付き再構成損失と、意味のあるオブジェクト分割を促す QSR/QTR ベースの監督で訓練する。

実験結果

リサーチクエスチョン

RQ1階層的でグラフベースの表現は、明示的な supervision なしに意味のあるオブジェクト中心のシーン要素を学習できるのか？
RQ2運動情報は非監視型のシーン分割と実世界の画像への一般化を改善するのか？
RQ3知覚的グルーピング原理とグラフベースのプーリング/ベクトル化は、シーン構造の学習にどのような影響を与えるのか？
RQ4学習された PSG が異なるデータセットやオブジェクトタイプ間でどの程度転移するのか？

主な発見

モデル	プリミティブ再検出率	プリミティブ mIoU	プリミティブ境界F値	Playroom 再検出率	Playroom mIoU	Playroom 境界F値	Gibson 再検出率	Gibson mIoU	Gibson 境界F値	Gibson ARI
MONet	0.35	0.40	0.46	0.28	0.34	0.46	0.06	0.12	0.15	0.27
IODINE	0.63	0.54	0.57	0.09	0.15	0.17	0.11	0.15	0.14	0.30
Q++ (RGBDN)	0.55	0.54	0.62	0.50	0.53	0.65	0.20	0.20	0.24	0.45
OP3	-	-	-	0.24	0.28	0.31	-	-	-	-
PSGNetS	0.75	0.65	0.70	0.64	0.57	0.66	0.34	0.38	0.37	0.53
PSGNetM	-	-	-	0.70	0.62	0.70	-	-	-	-

PSGNet は Primitives、Playroom、Gibson データセットにおける非監視型のシーン分割で MONet、IODINE、OP3 のベースラインを大幅に上回る。
PSGNetS を用いた静的訓練は強力な分割性能を示し、Primitives でベースラインを上回り Gibson で妥当な分解を達成。
運動ベースの訓練（PSGNetM）は Playroom の分割をさらに改善し、学習済みの運動情報によって静的画像性能を向上させる。
PSGNet は強い転移を示す：あるデータセットで訓練されたモデルが、オブジェクトモデルの重複が限られていても別データセットへ reasonably 転移する。
アブレーションにより、局所反復、フィードバック、二次レンダリングといった構成要素が性能に有意に寄与すること、深度/法線の監督は改善をもたらすが必須ではないことが示される。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。