QUICK REVIEW

[論文レビュー] Factorizable Net: An Efficient Subgraph-based Framework for Scene Graph Generation

Yikang Li, Wanli Ouyang|arXiv (Cornell University)|Jun 29, 2018

Multimodal Machine Learning Applications被引用数 46

ひとこと要約

Factorizable Net (F-Net)を導入し、シーングラフをサブグラフに因子分解して中間表現を削減し、空間情報を考慮したモジュールでより速く、より正確なシーングラフ生成を実現する。

ABSTRACT

Generating scene graph to describe all the relations inside an image gains increasing interests these years. However, most of the previous methods use complicated structures with slow inference speed or rely on the external data, which limits the usage of the model in real-life scenarios. To improve the efficiency of scene graph generation, we propose a subgraph-based connection graph to concisely represent the scene graph during the inference. A bottom-up clustering method is first used to factorize the entire scene graph into subgraphs, where each subgraph contains several objects and a subset of their relationships. By replacing the numerous relationship representations of the scene graph with fewer subgraph and object features, the computation in the intermediate stage is significantly reduced. In addition, spatial information is maintained by the subgraph features, which is leveraged by our proposed Spatial-weighted Message Passing~(SMP) structure and Spatial-sensitive Relation Inference~(SRI) module to facilitate the relationship recognition. On the recent Visual Relationship Detection and Visual Genome datasets, our method outperforms the state-of-the-art method in both accuracy and speed.

研究の動機と目的

二次の関係表現を超えた効率的なシーングラフ生成を促進する。
語句表現を共有し計算を削減するためのサブグラフベースの因子分解を提案する。
2-Dサブグラフ特徴マップと設計されたSMPおよびSRIモジュールを通じて空間情報を保持する。
VRDおよびVisual Genomeデータセットで速度と精度の改善を示す。

提案手法

全結合のオブジェクト関係グラフを構築し、類似した関係領域をサブグラフにクラスタリングする。
サブグラフを共通の2-D特徴マップで表現し、空間構造を保持する。
注意機構ベースの集約によってオブジェクトとサブグラフの特徴を洗練するためにSpatial-weighted Message Passingを適用する。
Spatial-sensitive Relation Inferenceを用いて主語・目的語・サブグラフの特徴を統合し、ボトルネック畳み込みアプローチで述語予測を行う。

実験結果

リサーチクエスチョン

RQ1サブグラフベースの表現は精度を犠牲にすることなく、シーングラフ生成の計算負荷を削減できるか？
RQ2サブグラフ特徴マップ内の空間情報を保持することは述語認識を改善するか？
RQ3空間認識型のメッセージパッシングとリレーション推論は、最先端手法を上回る性能を発揮するか？
RQ4標準データセット（VRDとVisual Genome）における速度と精度の観点でモデルはどう機能するか？

主な発見

Dataset	Model	PhrDet Rec@50	Rec@100	SGGen Rec@50	Rec@100	Speed (s/img)
VRD	LP	16.17	17.03	13.86	14.70	1.18 ∗
VRD	ViP-CNN	22.78	27.91	17.32	20.01	0.78
VRD	DR-Net	19.93	23.45	17.73	20.88	2.83
VRD	ILC	16.89	20.70	15.08	18.37	2.70 ∗∗
VRD	Ours Full:1-SMP	25.90	30.52	18.16	21.04	0.45
VRD	Ours Full:2-SMP	26.03	30.77	18.32	21.20	0.55
VG-MSDN	ISGG [58]	15.87	19.45	8.23	10.88	1.64
VG-MSDN	MSDN [35]	19.95	24.93	10.72	14.22	3.56
VG-MSDN	Ours-Full: 2-SMP	22.84	28.57	13.06	16.47	0.55
VG-DR-Net	DR-Net [6]	23.95	27.57	20.79	23.76	2.83
VG-DR-Net	Ours-Full: 2-SMP	26.91	32.63	19.88	23.95	0.55

VRDおよびVisual Genomeベンチマークにおいて、精度と速度の両方で最先端手法を上回る。
サブグラフベースのクラスタリングは中間フレーズ表現を劇的に削減し、推論を高速化する。
空間情報を保持する2-Dサブグラフ特徴マップは述語認識を改善する。
Spatial-weighted Message PassingとSpatial-sensitive InferenceはSGGenリコールとフレーズ検出で観察可能な向上をもたらす。
SMPモジュールの数を増やすと速度の一部犠牲を伴うが精度が向上することが示唆され、トレードオフを示す。
2-SMPを用いたフルモデルはベースラインと比較して高い性能を達成する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。