QUICK REVIEW

[论文解读] Factorizable Net: An Efficient Subgraph-based Framework for Scene Graph Generation

Yikang Li, Wanli Ouyang|arXiv (Cornell University)|Jun 29, 2018

Multimodal Machine Learning Applications被引用 46

一句话总结

引入 Factorizable Net（F-Net），将场景图分解为子图以减少中间表示，从而实现更快且更准确的场景图生成，具备空间感知模块。

ABSTRACT

Generating scene graph to describe all the relations inside an image gains increasing interests these years. However, most of the previous methods use complicated structures with slow inference speed or rely on the external data, which limits the usage of the model in real-life scenarios. To improve the efficiency of scene graph generation, we propose a subgraph-based connection graph to concisely represent the scene graph during the inference. A bottom-up clustering method is first used to factorize the entire scene graph into subgraphs, where each subgraph contains several objects and a subset of their relationships. By replacing the numerous relationship representations of the scene graph with fewer subgraph and object features, the computation in the intermediate stage is significantly reduced. In addition, spatial information is maintained by the subgraph features, which is leveraged by our proposed Spatial-weighted Message Passing~(SMP) structure and Spatial-sensitive Relation Inference~(SRI) module to facilitate the relationship recognition. On the recent Visual Relationship Detection and Visual Genome datasets, our method outperforms the state-of-the-art method in both accuracy and speed.

研究动机与目标

Motivate efficient scene graph generation beyond quadratic relationship representations.
Propose a subgraph-based factorization to share phrase representations and reduce computation.
Preserve spatial information via 2-D subgraph feature maps and designed SMP and SRI modules.
Demonstrate improved speed and accuracy on VRD and Visual Genome datasets.

提出的方法

Construct a fully-connected object relation graph and cluster similar relation regions into subgraphs.
Represent subgraphs with shared 2-D feature maps to retain spatial structure.
Apply Spatial-weighted Message Passing to refine object and subgraph features via attention-driven aggregation.
Use Spatial-sensitive Relation Inference to fuse subject, object, and subgraph features for predicate prediction with a bottlenecked convolution approach.

实验结果

研究问题

RQ1Can subgraph-based representations reduce the computational burden of scene graph generation without sacrificing accuracy?
RQ2Does maintaining spatial information in subgraph feature maps improve predicate recognition?
RQ3Can spatially-aware message passing and relation inference boost performance over state-of-the-art methods?
RQ4How does the model perform on standard datasets (VRD and Visual Genome) in terms of speed and accuracy?

主要发现

数据集	模型	短语检测 Rec@50	Rec@100	SGGen Rec@50	Rec@100	速度（s/图像）
VRD	LP	16.17	17.03	13.86	14.70	1.18 ∗
VRD	ViP-CNN	22.78	27.91	17.32	20.01	0.78
VRD	DR-Net	19.93	23.45	17.73	20.88	2.83
VRD	ILC	16.89	20.70	15.08	18.37	2.70 ∗∗
VRD	Ours Full:1-SMP	25.90	30.52	18.16	21.04	0.45
VRD	Ours Full:2-SMP	26.03	30.77	18.32	21.20	0.55
VG-MSDN	ISGG [58]	15.87	19.45	8.23	10.88	1.64
VG-MSDN	MSDN [35]	19.95	24.93	10.72	14.22	3.56
VG-MSDN	Ours-Full: 2-SMP	22.84	28.57	13.06	16.47	0.55
VG-DR-Net	DR-Net [6]	23.95	27.57	20.79	23.76	2.83
VG-DR-Net	Ours-Full: 2-SMP	26.91	32.63	19.88	23.95	0.55

Outperforms state-of-the-art methods in both accuracy and speed on VRD and Visual Genome benchmarks.
Subgraph-based clustering dramatically reduces intermediate phrase representations, speeding inference.
2-D subgraph feature maps preserving spatial information improve predicate recognition.
Spatial-weighted Message Passing and Spatial-sensitive Inference contribute observable gains in SGGen recall and phrase detection.
Increasing the number of SMP modules yields higher accuracy at some cost to speed, indicating a trade-off.
Full model with 2-SMP achieves strong results compared to baselines.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。