QUICK REVIEW

[论文解读] Visual Relationship Detection with Language Priors

Cewu Lu, Ranjay Krishna|arXiv (Cornell University)|Jul 31, 2016

Multimodal Machine Learning Applications参考文献 35被引用 112

一句话总结

这项工作提出一个可扩展的模型，学习对象和谓词的视觉外观，并利用语言先验来预测和定位数千种视觉关系，使零-shot与改进的图像检索成为可能。

ABSTRACT

Visual relationships capture a wide variety of interactions between pairs of objects in images (e.g. "man riding bicycle" and "man pushing bicycle"). Consequently, the set of possible relationships is extremely large and it is difficult to obtain sufficient training examples for all possible relationships. Because of this limitation, previous work on visual relationship detection has concentrated on predicting only a handful of relationships. Though most relationships are infrequent, their objects (e.g. "man" and "bicycle") and predicates (e.g. "riding" and "pushing") independently occur more frequently. We propose a model that uses this insight to train visual models for objects and predicates individually and later combines them together to predict multiple relationships per image. We improve on prior work by leveraging language priors from semantic word embeddings to finetune the likelihood of a predicted relationship. Our model can scale to predict thousands of types of relationships from a few examples. Additionally, we localize the objects in the predicted relationships as bounding boxes in the image. We further demonstrate that understanding relationships can improve content based image retrieval.

研究动机与目标

激励对多样化的视觉关系进行无偏检测和定位，超越少量常见关系。
提出一个由两模块组成的模型，学习对象和谓词的视觉外观并将它们融合以预测关系。
引入一个使用词向量的语言嵌入模块，以关联相似的关系。
展示零样本视觉关系检测，并在基于内容的图像检索中显示改进。
提供一个具有数千种关系类型的新数据集，用于基准测试视觉关系预测。

提出的方法

使用 CNN（VGG）和 RCNN 提议训练分离的对象检测器和谓词检测器。
Model visual relationships as V(R) = P_i(O1) * (z_k^T CNN(O1,O2) + s_k) * P_j(O2).
Project object pairs into a language embedding space with f(R) = w_k^T [word2vec(t_i), word2vec(t_j)] + b_k.
Encourage semantic similarity by minimizing the variance of distance-weighted embeddings (K(W)).
Impose a ranking loss L(W) to order observed relationships above unseen ones.
Combine V, L, and K in a joint objective for training (C + λ1 L + λ2 K).
At test time, score R* = argmax_R V(R,Θ|O1,O2) f(R,W) for each object pair.

实验结果

研究问题

RQ1Can visual relationships be detected by combining independently learned object/predicate appearances with language priors?
RQ2How does embedding-based language priors affect recognition, especially for infrequent or unseen relationships?
RQ3Does the proposed model scale to thousands of relationship types and support zero-shot learning?
RQ4Does leveraging relationships improve image retrieval performance?
RQ5How does the model perform on a new large-scale visual relationship dataset compared to prior methods?

主要发现

Phrase Det. R@100	Phrase Det. R@50	Relationship Det. R@100	Relationship Det. R@50	Predicate Det. R@100	Predicate Det. R@50
0.07	0.04	-	-	1.91	0.97
0.09	0.07	0.09	0.07	2.03	1.47
2.61	2.24	1.85	1.58	7.11	7.11
0.08	0.08	0.08	0.08	18.22	18.22
6.39	6.65	5.47	5.27	28.87	28.87
8.59	9.13	9.18	9.04	35.20	35.20
8.91	9.60	9.63	9.71	36.31	36.31
17.03	16.17	14.70	13.86	47.87	47.87

The full model (V + L + K) substantially outperforms prior methods on the new dataset for phrase, relationship, and predicate detection.
Zero-shot visual relationship detection improves when language priors and similarity embedding are used (K term).
Language priors enable scaling to thousands of relationships with limited examples and enable zero-shot in evaluation.
On the Visual Phrases dataset, the full model achieves higher mAP and strong recall, indicating benefits from embedding-based priors.
Using predicted relationships improves image-based retrieval, achieving higher Recall@1 and lower median rank than baselines.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。